Skip to content

Add ezpz MoE configurations and JSON override#8

Open
saforem2 wants to merge 2 commits into
saforem2:ezpzfrom
samuelwheeler:sww/moe-ezpz
Open

Add ezpz MoE configurations and JSON override#8
saforem2 wants to merge 2 commits into
saforem2:ezpzfrom
samuelwheeler:sww/moe-ezpz

Conversation

@saforem2

@saforem2 saforem2 commented Mar 11, 2026

Copy link
Copy Markdown
Owner

This pull request introduces a framework for launching DeepSeek V3-style Mixture-of-Experts (MoE) training runs, with a focus on flexible, large-scale configuration via JSON files and environment variables. It adds a new set of configuration files and documentation for various MoE training scenarios, and enhances the configuration registry to support overlaying JSON-based overrides onto base configs. Additionally, it improves robustness in the FSDP gradient division logic to better handle PyTorch version differences.

The most important changes are:

1. MoE Run Configuration and Documentation

  • Added a new directory (moe_runs) containing multiple JSON configuration files for different DeepSeek V3 MoE training scenarios (e.g., 2 nodes, 128 nodes, various batch sizes, activation checkpointing strategies, and compilation options). This enables easy experimentation with different distributed and model configurations. [1] [2] [3] [4] [5] [6] [7] [8] [9]
  • Added a comprehensive README (moe_runs/README.md) explaining how to launch these MoE runs, customize configs, and use the new setup.

2. JSON-Based Config Overrides

  • Enhanced the config_registry.py to support loading configuration overrides from a JSON file specified via the TT_CONFIG_JSON environment variable, and overlaying these onto base configs. This allows dynamic, environment-driven configuration without code changes. [1] [2] [3] [4]

3. Improved FSDP Gradient Division Handling

  • Updated the FSDP gradient division logic in parallelize.py to be more robust against PyTorch version changes by dynamically checking for the presence of relevant methods, rather than relying on specific class imports or locations. This increases compatibility and maintainability. [1] [2]

4. Example/Test Configs

  • Added example/test JSON config files (e.g., ezpz_test.json) to demonstrate or validate the new configuration override mechanism.

These changes collectively make it easier to configure, launch, and experiment with large-scale MoE training runs on different hardware setups, while improving the maintainability and flexibility of the codebase.

Summary by Sourcery

Introduce ezpz-based DeepSeek V3 MoE run presets with JSON-driven configuration overrides and more robust runtime behaviors for compilation, vocab sizing, and FSDP gradient handling.

New Features:

  • Add DeepSeek V3 10B/2B EP12 trainer config and ezpz launcher scripts/JSON presets for MoE runs across different node and batch configurations.
  • Support JSON-configured ezpz AGPT and DeepSeek V3 configs via TT_CONFIG_JSON, enabling environment-driven overrides of base trainer configs.
  • Automatically infer and override DeepSeek V3 model vocab size from Hugging Face tokenizer assets when available.
  • Enable selective compilation of only feed-forward modules in DeepSeek V3 models via a dedicated compile component flag.

Bug Fixes:

  • Make FSDP gradient division configuration resilient to PyTorch FSDP API and backend variations in both LLaMA 3 and ezpz AGPT paths.

Enhancements:

  • Extend Trainer dataloader construction with global batch size, training steps, and parallelism metadata.
  • Add a downscaled DeepSeek V3 10B/2B EP12 MoE model spec derived from the existing 16B architecture for ezpz experiments.

Documentation:

  • Add README documenting DeepSeek V3 MoE ezpz runs, launcher usage, tokenizer asset setup, and JSON override workflows.

Tests:

  • Add example JSON configs (including ezpz_test.json and multiple MoE run variants) to validate and illustrate the JSON override mechanism and MoE run presets.

Chores:

  • Add PBS submission script for DeepSeek V3 MoE 128-node ezpz runs.

Copilot AI review requested due to automatic review settings March 11, 2026 19:03
@sourcery-ai

sourcery-ai Bot commented Mar 11, 2026

Copy link
Copy Markdown

Reviewer's Guide

Adds ezpz-friendly DeepSeek V3 MoE run configs and launcher scripts, introduces a JSON-based override mechanism for TorchTitan configs, improves DeepSeek V3 model/parallelization behaviors (including selective FFN-only compile and vocab inference), and makes FSDP gradient-division handling more robust across backends and PyTorch versions.

Sequence diagram for JSON-based config override and DeepSeek V3 MoE launch

sequenceDiagram
    actor User
    participant Launcher as launch_deepseek_v3_moe_ep12_sh
    participant Ezpz as ezpz_utils_env
    participant TorchTitan as torchtitan_train_cli
    participant ConfigReg as deepseek_v3_config_registry
    participant JsonFile as tt_config_json_file
    participant Trainer as Trainer
    participant Model as DeepSeekV3Model

    User->>Launcher: invoke with wandb_name and extra_args
    Launcher->>Launcher: set TT_CONFIG_JSON default if unset
    Launcher->>Launcher: resolve HF_ASSETS_PATH
    Launcher->>Ezpz: ezpz_setup_env
    Ezpz-->>Launcher: environment configured

    Launcher->>TorchTitan: python -m torchtitan.experiments.ezpz.train activate deepseek_v3_10b_2b_ep12_from_json

    TorchTitan->>ConfigReg: deepseek_v3_10b_2b_ep12_from_json()
    ConfigReg->>ConfigReg: deepseek_v3_10b_2b_ep12()
    ConfigReg-->>TorchTitan: base Trainer_Config

    TorchTitan->>ConfigReg: _load_json_overrides()
    ConfigReg->>ConfigReg: read TT_CONFIG_JSON_ENV
    ConfigReg->>JsonFile: open and parse JSON
    JsonFile-->>ConfigReg: overrides dict
    ConfigReg-->>TorchTitan: overrides dict

    TorchTitan->>ConfigReg: _apply_config_overrides(cfg, overrides)
    ConfigReg->>ConfigReg: recursively apply overrides to Trainer_Config
    ConfigReg-->>TorchTitan: mutated Trainer_Config

    TorchTitan->>Trainer: __init__(config)
    Trainer->>Trainer: build tokenizer using HF_ASSETS_PATH
    Trainer->>Model: update_from_config(trainer_config)
    Model->>Model: _infer_vocab_size_from_tokenizer_assets(hf_assets_path)
    Model->>Model: if vocab_size differs, override model.vocab_size
    Model-->>Trainer: configured model

    Trainer->>Trainer: parallelize_deepseekv3 (activation_checkpoint, compile, fsdp, moe)
    Trainer-->>User: training run active with JSON overrides
Loading

Updated class diagram for Trainer config, DeepSeek V3 model, and JSON override helpers

classDiagram
    class Trainer {
        +Config config
        +__init__(config)
    }

    class Trainer_Config {
        +TrainingConfig training
        +ParallelismConfig parallelism
        +ActivationCheckpointConfig activation_checkpoint
        +CompileConfig compile
        +MetricsProcessor_Config metrics
        +OptimizersContainer_Config optimizer
        +LRSchedulersContainer_Config lr_scheduler
        +CheckpointManager_Config checkpoint
        +str hf_assets_path
        +Any dataloader
        +Any debug
    }

    class TrainingConfig {
        +int local_batch_size
        +int global_batch_size
        +int seq_len
        +int steps
    }

    class ParallelismConfig {
        +int data_parallel_replicate_degree
        +int data_parallel_shard_degree
        +int tensor_parallel_degree
        +int pipeline_parallel_degree
        +int context_parallel_degree
        +int expert_parallel_degree
        +int expert_tensor_parallel_degree
        +str pipeline_parallel_schedule
    }

    class CompileConfig {
        +bool enable
        +str backend
        +list~str~ components
    }

    class DeepSeekV3Model {
        +int vocab_size
        +Any layers
        +Any rope
        +update_from_config(trainer_config)
    }

    class DeepSeekV3Model_helpers {
        +_infer_vocab_size_from_tokenizer_assets(hf_assets_path) int_or_None
        +apply_compile_feed_forward_only(model, compile_config) void
    }

    class DeepSeekV3_ConfigRegistry {
        +deepseek_v3_10b_2b_ep12() Trainer_Config
        +deepseek_v3_10b_2b_ep12_from_json() Trainer_Config
        +_deepseek_v3_10b_2b_model_spec() Any
        +_load_json_overrides() dict
        +_apply_config_overrides(target, overrides, path) void
    }

    class Ezpz_AGPT_ConfigRegistry {
        +_base_config(flavor) FaultTolerantTrainer_Config
        +_config_from_json(flavor) FaultTolerantTrainer_Config
        +ezpz_agpt_debugmodel_from_json() FaultTolerantTrainer_Config
        +ezpz_agpt_2b_from_json() FaultTolerantTrainer_Config
        +ezpz_agpt_7b_from_json() FaultTolerantTrainer_Config
        +ezpz_agpt_8b_from_json() FaultTolerantTrainer_Config
        +_load_json_overrides() dict
        +_apply_config_overrides(target, overrides, path) void
    }

    class FaultTolerantTrainer_Config {
        +Any training
        +Any parallelism
        +Any dataloader
        +Any optimizer
        +Any checkpoint
        +Any compile
    }

    Trainer --> Trainer_Config : uses
    Trainer_Config --> TrainingConfig : has
    Trainer_Config --> ParallelismConfig : has
    Trainer_Config --> CompileConfig : has

    Trainer --> DeepSeekV3Model : builds
    DeepSeekV3Model ..> DeepSeekV3Model_helpers : calls

    DeepSeekV3_ConfigRegistry --> Trainer_Config : creates
    Ezpz_AGPT_ConfigRegistry --> FaultTolerantTrainer_Config : creates

    DeepSeekV3_ConfigRegistry ..> DeepSeekV3Model : configures model_spec
Loading

File-Level Changes

Change Details Files
Introduce JSON-driven config override support for DeepSeek V3 and ezpz AGPT trainers, plus a new 10B/2B EP12 DeepSeek V3 base config.
  • Add TT_CONFIG_JSON_ENV, _load_json_overrides, and _apply_config_overrides helpers to load and recursively apply JSON overrides onto dataclass config objects with validation of fields and shapes.
  • Define a DeepSeek V3 10B/2B MoE model spec derived from the 16B config and register deepseek_v3_10b_2b_ep12 and *_from_json trainer configs using BlendCorpus dataloader and EP12 parallelism settings.
  • Extend ezpz AGPT config registry with *_from_json variants that construct base configs by flavor and then overlay JSON-driven overrides.
torchtitan/models/deepseek_v3/config_registry.py
torchtitan/experiments/ezpz/agpt/config_registry.py
Enhance DeepSeek V3 parallelization and model behavior for MoE runs (FFN-only compile and tokenizer-driven vocab sizing).
  • Add apply_compile_feed_forward_only to selectively torch.compile dense feed-forward and shared-expert FFN blocks while leaving attention and routing paths in eager mode, and wire it into parallelize_deepseekv3 via a new compile component flag.
  • Extend Trainer initialization to pass global_batch_size, steps, and parallel_dims into model.update_from_config context.
  • Add tokenizer-asset-based vocab size inference in DeepSeek V3 model.update_from_config, overriding self.vocab_size when tokenizer.json is present and disagrees.
torchtitan/models/deepseek_v3/parallelize.py
torchtitan/trainer.py
torchtitan/models/deepseek_v3/model.py
Make FSDP gradient division configuration robust across PyTorch releases and non-NCCL backends.
  • Update llama3.disable_fsdp_gradient_division to dynamically detect FSDP-like modules via presence of set_gradient_divide_factor, optionally enforce force-sum reduction on non-NCCL backends, and log how many modules were updated.
  • Align ezpz AGPT disable_fsdp_gradient_division with the same dynamic detection and backend handling, using ezpz.dist.get_torch_backend when available and falling back to torch.distributed.get_backend, plus logging.
torchtitan/models/llama3/parallelize.py
torchtitan/experiments/ezpz/agpt/parallelize.py
Add ezpz MoE experiment scaffolding: launcher script, README, and multiple DeepSeek V3 MoE JSON configs for different scales and settings.
  • Introduce ezpz launcher script for DeepSeek V3 10B/2B EP12 MoE runs that sets up environment (via ezpz utils), discovers HF assets, wires TT_CONFIG_JSON, WandB metadata, and invokes the deepseek_v3_10b_2b_ep12_from_json config.
  • Document the MoE run setup, available JSON configs, and usage patterns in moe_runs/README.md, including how TT_CONFIG_JSON overlays onto the base config and how HF_ASSETS_PATH is resolved.
  • Add a matrix of JSON override files covering smoke tests, 2-node production simulations, 128-node simulations, different seq lens, activation checkpoint and compile strategies, and an example ezpz_test.json plus a PBS submission script for 128-node runs.
torchtitan/experiments/ezpz/moe_runs/launch_deepseek_v3_moe_ep12.sh
torchtitan/experiments/ezpz/moe_runs/README.md
configs/ezpz_test.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_128nodes_4096_prod_sim_ac_lb2_compile_ffn.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_perf.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_full_lb3.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_layer1_lb3.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_lb2_compile_ffn.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_lb3.json
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_smoke.json
torchtitan/experiments/ezpz/moe_runs/submit_deepseek_v3_moe_ep12_128n_6h.pbs

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The JSON override helpers (_load_json_overrides / _apply_config_overrides) are duplicated between deepseek_v3 and ezpz/agpt config registries; consider factoring these into a shared utility to avoid divergence and keep the error semantics consistent.
  • The tokenizer-based vocab size inference in _infer_vocab_size_from_tokenizer_assets assumes a specific tokenizer.json structure and successful JSON parse; adding a small try/except and a more defensive fallback path would make this more robust to HF tokenizer format changes or partial assets.
  • The FSDP gradient-division logic (backend detection and set_* method checks) is now very similar between llama3.parallelize and ezpz/agpt.parallelize; pulling this into a shared helper would reduce duplication and help keep behavior aligned across models as it evolves.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The JSON override helpers (`_load_json_overrides` / `_apply_config_overrides`) are duplicated between `deepseek_v3` and `ezpz/agpt` config registries; consider factoring these into a shared utility to avoid divergence and keep the error semantics consistent.
- The tokenizer-based vocab size inference in `_infer_vocab_size_from_tokenizer_assets` assumes a specific `tokenizer.json` structure and successful JSON parse; adding a small try/except and a more defensive fallback path would make this more robust to HF tokenizer format changes or partial assets.
- The FSDP gradient-division logic (backend detection and `set_*` method checks) is now very similar between `llama3.parallelize` and `ezpz/agpt.parallelize`; pulling this into a shared helper would reduce duplication and help keep behavior aligned across models as it evolves.

## Individual Comments

### Comment 1
<location path="torchtitan/experiments/ezpz/agpt/config_registry.py" line_range="57-66" />
<code_context>
+def _load_json_overrides() -> dict[str, Any]:
</code_context>
<issue_to_address>
**suggestion:** The JSON override helpers are duplicated across modules and could be centralized to avoid divergence.

Since this logic is now duplicated with `torchtitan/models/deepseek_v3/config_registry.py`, please extract these helpers into a shared utility (e.g., a `config_overrides` module) and reuse them from both locations to keep future behavior changes consistent.

Suggested implementation:

```python
from torchtitan.config_overrides import load_json_overrides as _load_json_overrides

```

To fully implement the centralization you requested, you will also need to:

1. Create a shared helper module, e.g. `torchtitan/config_overrides.py`, with something like:
   ```python
   import json
   import os
   from typing import Any, Dict

   TT_CONFIG_JSON_ENV = "TT_CONFIG_JSON"

   def load_json_overrides(tt_config_env: str = TT_CONFIG_JSON_ENV) -> Dict[str, Any]:
       path = os.environ.get(tt_config_env, "").strip()
       if not path:
           raise ValueError(
               f"{tt_config_env} must point to a JSON file when using *_from_json configs."
           )

       with open(path, encoding="utf-8") as f:
           overrides = json.load(f)

       if not isinstance(overrides, dict):
           raise TypeError(
               f"JSON overrides loaded from {path!r} must be an object at the top level, "
               f"got {type(overrides).__name__}"
           )

       return overrides
   ```

2. Update `torchtitan/models/deepseek_v3/config_registry.py` to remove its local JSON override helper and reuse the shared one, mirroring how this file now imports it:
   ```python
   from torchtitan.config_overrides import load_json_overrides as _load_json_overrides
   ```
   and delete the duplicated helper implementation there.

3. Ensure that any references in both modules still use `_load_json_overrides()` as before so behavior remains unchanged, while the implementation is now shared and future changes only need to be made in `torchtitan/config_overrides.py`.
</issue_to_address>

### Comment 2
<location path="torchtitan/experiments/ezpz/moe_runs/launch_deepseek_v3_moe_ep12.sh" line_range="20" />
<code_context>
+# Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT),
+# so temporarily disable nounset during environment initialization.
+set +u
+source <(curl -fsSL https://bit.ly/ezpz-utils)
+ezpz_setup_env
+set -u
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Sourcing a remote script on every launch introduces reliability and security risks.

Using `curl -fsSL https://bit.ly/ezpz-utils` at runtime makes each job depend on an unversioned remote script behind a shortlink. Any outage, redirect, or upstream change could break runs or silently alter behavior. Prefer vendoring a copy into the repo or using a stable, versioned URL to improve reproducibility and limit the impact of upstream changes.

Suggested implementation:

```
# Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT),
# so temporarily disable nounset during environment initialization.
set +u

# Use a vendored ezpz-utils script to avoid runtime dependency on a remote shortlink.
EZPZ_UTILS="${REPO_ROOT}/torchtitan/experiments/ezpz/ezpz-utils.sh"
if [[ ! -r "${EZPZ_UTILS}" ]]; then
  echo "ERROR: ezpz utils script not found at: ${EZPZ_UTILS}" >&2
  echo "Please ensure a stable, versioned copy of ezpz-utils.sh is checked into the repo." >&2
  exit 1
fi

# shellcheck source=/dev/null
source "${EZPZ_UTILS}"
ezpz_setup_env
set -u

```

1. Add a vendored copy of the `ezpz-utils` script at:
   `torchtitan/experiments/ezpz/ezpz-utils.sh` (this should be a stable, versioned copy from upstream).
2. Keep this file updated via your normal dependency/version management process, rather than sourcing directly from a remote URL at runtime.
</issue_to_address>

### Comment 3
<location path="torchtitan/experiments/ezpz/moe_runs/README.md" line_range="30" />
<code_context>
+- `steps=40`
+- `attn_backend=sdpa`
+- `compile.enable=false`
+- checkpoint save every 5 steps (`enable_first_step_checkpoint=true`)
+
+## Run
</code_context>
<issue_to_address>
**nitpick (typo):** Consider fixing the grammar in this bullet point.

For example: "checkpoint saved every 5 steps" or "checkpoints saved every 5 steps".

```suggestion
- checkpoint saved every 5 steps (`enable_first_step_checkpoint=true`)
```
</issue_to_address>

### Comment 4
<location path="torchtitan/models/deepseek_v3/config_registry.py" line_range="36" />
<code_context>

 from . import model_registry

+TT_CONFIG_JSON_ENV = "TT_CONFIG_JSON"
+

</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the JSON config override logic into a shared helper module and having registry files only call those helpers to keep them declarative and DRY.

You can reduce complexity and duplication by extracting the JSON override logic into a shared helper module and keeping this registry file declarative.

### 1. Move override helpers into a shared module

Create e.g. `torchtitan/config/overrides.py`:

```python
# torchtitan/config/overrides.py
import json
import os
from dataclasses import is_dataclass
from typing import Any

def load_overrides_from_env(env_var: str = "TT_CONFIG_JSON") -> dict[str, Any]:
    path = os.environ.get(env_var, "").strip()
    if not path:
        raise ValueError(
            f"{env_var} must point to a JSON file when using *_from_json configs."
        )

    with open(path, encoding="utf-8") as f:
        overrides = json.load(f)

    if not isinstance(overrides, dict):
        raise ValueError(
            f"Expected top-level JSON object in {path!r}, "
            f"got {type(overrides).__name__}."
        )

    return overrides


def apply_overrides(target: Any, overrides: dict[str, Any], path: str = "") -> None:
    for key, value in overrides.items():
        if not hasattr(target, key):
            raise KeyError(f"Unknown config field {key!r} at path {path or '<root>'}.")

        current_value = getattr(target, key)
        field_path = f"{path}.{key}" if path else key

        if isinstance(value, dict):
            if not is_dataclass(current_value):
                raise TypeError(
                    f"Expected dataclass at {field_path!r} for nested override, "
                    f"got {type(current_value).__name__}."
                )
            apply_overrides(current_value, value, field_path)
            continue

        setattr(target, key, value)
```

This keeps all existing behavior (env var name, validation, recursive dataclass handling), but in one place.

### 2. Simplify this registry file to call shared helpers

In the registry file, remove `_load_json_overrides`, `_apply_config_overrides`, and `TT_CONFIG_JSON_ENV`, and import the shared helpers instead:

```python
# in this registry file
from torchtitan.config.overrides import load_overrides_from_env, apply_overrides

def deepseek_v3_10b_2b_ep12_from_json() -> Trainer.Config:
    cfg = deepseek_v3_10b_2b_ep12()
    overrides = load_overrides_from_env("TT_CONFIG_JSON")
    apply_overrides(cfg, overrides)
    return cfg
```

### 3. Update the other registry using the same helpers

In `experiments/ezpz/agpt/config_registry.py` (where the logic is duplicated), replace the local copies with the same import and usage pattern:

```python
from torchtitan.config.overrides import load_overrides_from_env, apply_overrides

def some_config_from_json() -> Trainer.Config:
    cfg = some_config()
    overrides = load_overrides_from_env("TT_CONFIG_JSON")
    apply_overrides(cfg, overrides)
    return cfg
```

This removes the recursive override engine from both registry files, avoids copy-paste divergence, and makes future changes to override semantics centralized and easier to maintain, while preserving all current functionality.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +57 to +66
def _load_json_overrides() -> dict[str, Any]:
path = os.environ.get(TT_CONFIG_JSON_ENV, "").strip()
if not path:
raise ValueError(
f"{TT_CONFIG_JSON_ENV} must point to a JSON file when using *_from_json configs."
)

with open(path, encoding="utf-8") as f:
overrides = json.load(f)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The JSON override helpers are duplicated across modules and could be centralized to avoid divergence.

Since this logic is now duplicated with torchtitan/models/deepseek_v3/config_registry.py, please extract these helpers into a shared utility (e.g., a config_overrides module) and reuse them from both locations to keep future behavior changes consistent.

Suggested implementation:

from torchtitan.config_overrides import load_json_overrides as _load_json_overrides

To fully implement the centralization you requested, you will also need to:

  1. Create a shared helper module, e.g. torchtitan/config_overrides.py, with something like:

    import json
    import os
    from typing import Any, Dict
    
    TT_CONFIG_JSON_ENV = "TT_CONFIG_JSON"
    
    def load_json_overrides(tt_config_env: str = TT_CONFIG_JSON_ENV) -> Dict[str, Any]:
        path = os.environ.get(tt_config_env, "").strip()
        if not path:
            raise ValueError(
                f"{tt_config_env} must point to a JSON file when using *_from_json configs."
            )
    
        with open(path, encoding="utf-8") as f:
            overrides = json.load(f)
    
        if not isinstance(overrides, dict):
            raise TypeError(
                f"JSON overrides loaded from {path!r} must be an object at the top level, "
                f"got {type(overrides).__name__}"
            )
    
        return overrides
  2. Update torchtitan/models/deepseek_v3/config_registry.py to remove its local JSON override helper and reuse the shared one, mirroring how this file now imports it:

    from torchtitan.config_overrides import load_json_overrides as _load_json_overrides

    and delete the duplicated helper implementation there.

  3. Ensure that any references in both modules still use _load_json_overrides() as before so behavior remains unchanged, while the implementation is now shared and future changes only need to be made in torchtitan/config_overrides.py.

# Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT),
# so temporarily disable nounset during environment initialization.
set +u
source <(curl -fsSL https://bit.ly/ezpz-utils)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 suggestion (security): Sourcing a remote script on every launch introduces reliability and security risks.

Using curl -fsSL https://bit.ly/ezpz-utils at runtime makes each job depend on an unversioned remote script behind a shortlink. Any outage, redirect, or upstream change could break runs or silently alter behavior. Prefer vendoring a copy into the repo or using a stable, versioned URL to improve reproducibility and limit the impact of upstream changes.

Suggested implementation:

# Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT),
# so temporarily disable nounset during environment initialization.
set +u

# Use a vendored ezpz-utils script to avoid runtime dependency on a remote shortlink.
EZPZ_UTILS="${REPO_ROOT}/torchtitan/experiments/ezpz/ezpz-utils.sh"
if [[ ! -r "${EZPZ_UTILS}" ]]; then
  echo "ERROR: ezpz utils script not found at: ${EZPZ_UTILS}" >&2
  echo "Please ensure a stable, versioned copy of ezpz-utils.sh is checked into the repo." >&2
  exit 1
fi

# shellcheck source=/dev/null
source "${EZPZ_UTILS}"
ezpz_setup_env
set -u

  1. Add a vendored copy of the ezpz-utils script at:
    torchtitan/experiments/ezpz/ezpz-utils.sh (this should be a stable, versioned copy from upstream).
  2. Keep this file updated via your normal dependency/version management process, rather than sourcing directly from a remote URL at runtime.

- `steps=40`
- `attn_backend=sdpa`
- `compile.enable=false`
- checkpoint save every 5 steps (`enable_first_step_checkpoint=true`)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (typo): Consider fixing the grammar in this bullet point.

For example: "checkpoint saved every 5 steps" or "checkpoints saved every 5 steps".

Suggested change
- checkpoint save every 5 steps (`enable_first_step_checkpoint=true`)
- checkpoint saved every 5 steps (`enable_first_step_checkpoint=true`)


from . import model_registry

TT_CONFIG_JSON_ENV = "TT_CONFIG_JSON"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting the JSON config override logic into a shared helper module and having registry files only call those helpers to keep them declarative and DRY.

You can reduce complexity and duplication by extracting the JSON override logic into a shared helper module and keeping this registry file declarative.

1. Move override helpers into a shared module

Create e.g. torchtitan/config/overrides.py:

# torchtitan/config/overrides.py
import json
import os
from dataclasses import is_dataclass
from typing import Any

def load_overrides_from_env(env_var: str = "TT_CONFIG_JSON") -> dict[str, Any]:
    path = os.environ.get(env_var, "").strip()
    if not path:
        raise ValueError(
            f"{env_var} must point to a JSON file when using *_from_json configs."
        )

    with open(path, encoding="utf-8") as f:
        overrides = json.load(f)

    if not isinstance(overrides, dict):
        raise ValueError(
            f"Expected top-level JSON object in {path!r}, "
            f"got {type(overrides).__name__}."
        )

    return overrides


def apply_overrides(target: Any, overrides: dict[str, Any], path: str = "") -> None:
    for key, value in overrides.items():
        if not hasattr(target, key):
            raise KeyError(f"Unknown config field {key!r} at path {path or '<root>'}.")

        current_value = getattr(target, key)
        field_path = f"{path}.{key}" if path else key

        if isinstance(value, dict):
            if not is_dataclass(current_value):
                raise TypeError(
                    f"Expected dataclass at {field_path!r} for nested override, "
                    f"got {type(current_value).__name__}."
                )
            apply_overrides(current_value, value, field_path)
            continue

        setattr(target, key, value)

This keeps all existing behavior (env var name, validation, recursive dataclass handling), but in one place.

2. Simplify this registry file to call shared helpers

In the registry file, remove _load_json_overrides, _apply_config_overrides, and TT_CONFIG_JSON_ENV, and import the shared helpers instead:

# in this registry file
from torchtitan.config.overrides import load_overrides_from_env, apply_overrides

def deepseek_v3_10b_2b_ep12_from_json() -> Trainer.Config:
    cfg = deepseek_v3_10b_2b_ep12()
    overrides = load_overrides_from_env("TT_CONFIG_JSON")
    apply_overrides(cfg, overrides)
    return cfg

3. Update the other registry using the same helpers

In experiments/ezpz/agpt/config_registry.py (where the logic is duplicated), replace the local copies with the same import and usage pattern:

from torchtitan.config.overrides import load_overrides_from_env, apply_overrides

def some_config_from_json() -> Trainer.Config:
    cfg = some_config()
    overrides = load_overrides_from_env("TT_CONFIG_JSON")
    apply_overrides(cfg, overrides)
    return cfg

This removes the recursive override engine from both registry files, avoids copy-paste divergence, and makes future changes to override semantics centralized and easier to maintain, while preserving all current functionality.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an “ezpz” launch/config framework for running DeepSeek V3-style MoE training (including multiple JSON run presets), introduces JSON-based config overlays via TT_CONFIG_JSON, and improves robustness of FSDP gradient-division handling across PyTorch versions.

Changes:

  • Add DeepSeek V3 MoE run presets + launcher/submission scripts under torchtitan/experiments/ezpz/moe_runs/.
  • Add *_from_json config entrypoints that load and apply JSON overrides from TT_CONFIG_JSON.
  • Improve PyTorch-version resilience for disabling FSDP gradient division and add a DeepSeekV3 “compile feed-forward only” path.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
torchtitan/trainer.py Pass additional runtime kwargs (e.g., global batch size / steps / parallel dims) into dataloader construction.
torchtitan/models/llama3/parallelize.py Make FSDP gradient-division disabling resilient to PyTorch internal class location changes.
torchtitan/models/deepseek_v3/parallelize.py Add optional feed_forward-only compilation path for DeepSeek V3.
torchtitan/models/deepseek_v3/model.py Infer vocab size from tokenizer assets and override model vocab_size accordingly.
torchtitan/models/deepseek_v3/config_registry.py Add JSON override overlay utilities + a new DeepSeek V3 MoE config (10b_2b_ep12) and *_from_json entrypoint.
torchtitan/experiments/ezpz/moe_runs/submit_deepseek_v3_moe_ep12_128n_6h.pbs Add a PBS submission template for 128-node runs.
torchtitan/experiments/ezpz/moe_runs/launch_deepseek_v3_moe_ep12.sh Add a launcher that wires W&B + TT_CONFIG_JSON + assets discovery, then launches training.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_smoke.json Add smoke-test JSON overrides for a small DeepSeek V3 MoE run.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_lb3.json Add a 2-node 4096-seq config preset (LB=3) with selective AC.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_lb2_compile_ffn.json Add a 2-node 4096-seq preset that enables feed_forward compile + loss compile.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_layer1_lb3.json Add a 2-node 4096-seq preset using layer-frequency selective AC.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_full_lb3.json Add a 2-node 4096-seq preset using full AC.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac.json Add a 2-node 4096-seq preset with selective AC.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim.json Add a 2-node 4096-seq preset with AC disabled.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_perf.json Add a 2-node 4096-seq throughput/perf-oriented preset.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes.json Add a baseline 2-node JSON overrides preset.
torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_128nodes_4096_prod_sim_ac_lb2_compile_ffn.json Add a large-scale (128-node) JSON overrides preset.
torchtitan/experiments/ezpz/moe_runs/README.md Document how to use the new MoE run launcher and presets.
torchtitan/experiments/ezpz/agpt/parallelize.py Apply the same FSDP gradient-division robustness pattern in the ezpz AGPT parallelization path.
torchtitan/experiments/ezpz/agpt/config_registry.py Add *_from_json config entrypoints for ezpz AGPT via TT_CONFIG_JSON.
configs/ezpz_test.json Add an example JSON override file for testing the override mechanism.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +61 to +63
if [[ ! -f "${HF_ASSETS_PATH}/tokenizer.json" && ! -f "${HF_ASSETS_PATH}/tokenizer.model" ]]; then
echo "HF_ASSETS_PATH=${HF_ASSETS_PATH} does not contain tokenizer.json or tokenizer.model"
echo "Please point HF_ASSETS_PATH at a valid tokenizer assets directory."

Copilot AI Mar 11, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HF assets validation allows a directory with only tokenizer.model, but the default TorchTitan HuggingFaceTokenizer loader does not support SentencePiece .model files (it expects tokenizer.json or vocab/merges files). This check can therefore pass and still fail later when building the tokenizer; consider requiring tokenizer.json (or vocab/merges) here, or switching the run config to use a tokenizer backend that supports tokenizer.model.

Suggested change
if [[ ! -f "${HF_ASSETS_PATH}/tokenizer.json" && ! -f "${HF_ASSETS_PATH}/tokenizer.model" ]]; then
echo "HF_ASSETS_PATH=${HF_ASSETS_PATH} does not contain tokenizer.json or tokenizer.model"
echo "Please point HF_ASSETS_PATH at a valid tokenizer assets directory."
if [[ ! -f "${HF_ASSETS_PATH}/tokenizer.json" ]]; then
echo "HF_ASSETS_PATH=${HF_ASSETS_PATH} does not contain the required tokenizer.json file."
echo "Please point HF_ASSETS_PATH at a valid tokenizer assets directory containing tokenizer.json."

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +60
tokenizer_json_path = os.path.join(hf_assets_path, "tokenizer.json")
if not os.path.exists(tokenizer_json_path):
return None

with open(tokenizer_json_path, encoding="utf-8") as f:
tokenizer_data = json.load(f)

vocab = tokenizer_data.get("model", {}).get("vocab", {})
added_tokens = tokenizer_data.get("added_tokens", [])

max_token_id = -1
if isinstance(vocab, dict):
for token_id in vocab.values():
if isinstance(token_id, int):
max_token_id = max(max_token_id, token_id)

if isinstance(added_tokens, list):
for token_info in added_tokens:
if isinstance(token_info, dict):
token_id = token_info.get("id")
if isinstance(token_id, int):
max_token_id = max(max_token_id, token_id)

if max_token_id < 0:
return None
return max_token_id + 1


Copilot AI Mar 11, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_infer_vocab_size_from_tokenizer_assets only reads tokenizer.json, but TorchTitan tokenizers can also be loaded from vocab.json/vocab.txt (+ merges.txt). If a valid assets dir lacks tokenizer.json, vocab inference will silently return None and vocab_size will stay at the default, potentially causing a shape mismatch with the tokenizer. Consider reusing the same tokenizer-loading logic as HuggingFaceTokenizer (or otherwise supporting vocab/merges formats) when inferring vocab size.

Suggested change
tokenizer_json_path = os.path.join(hf_assets_path, "tokenizer.json")
if not os.path.exists(tokenizer_json_path):
return None
with open(tokenizer_json_path, encoding="utf-8") as f:
tokenizer_data = json.load(f)
vocab = tokenizer_data.get("model", {}).get("vocab", {})
added_tokens = tokenizer_data.get("added_tokens", [])
max_token_id = -1
if isinstance(vocab, dict):
for token_id in vocab.values():
if isinstance(token_id, int):
max_token_id = max(max_token_id, token_id)
if isinstance(added_tokens, list):
for token_info in added_tokens:
if isinstance(token_info, dict):
token_id = token_info.get("id")
if isinstance(token_id, int):
max_token_id = max(max_token_id, token_id)
if max_token_id < 0:
return None
return max_token_id + 1
"""
Try to infer the tokenizer vocabulary size from assets stored in ``hf_assets_path``.
Preference order:
1. ``tokenizer.json`` (uses max token id + 1, including added tokens)
2. ``vocab.json`` (length of vocab)
3. ``vocab.txt`` (number of non-empty, non-comment lines)
"""
# First try tokenizer.json, which may contain explicit token ids.
tokenizer_json_path = os.path.join(hf_assets_path, "tokenizer.json")
if os.path.exists(tokenizer_json_path):
with open(tokenizer_json_path, encoding="utf-8") as f:
tokenizer_data = json.load(f)
vocab = tokenizer_data.get("model", {}).get("vocab", {})
added_tokens = tokenizer_data.get("added_tokens", [])
max_token_id = -1
if isinstance(vocab, dict):
for token_id in vocab.values():
if isinstance(token_id, int):
max_token_id = max(max_token_id, token_id)
if isinstance(added_tokens, list):
for token_info in added_tokens:
if isinstance(token_info, dict):
token_id = token_info.get("id")
if isinstance(token_id, int):
max_token_id = max(max_token_id, token_id)
if max_token_id >= 0:
return max_token_id + 1
# Fallback: try vocab.json, which typically contains a mapping from token to id
# or a list of tokens. In both cases, the length represents the vocab size.
vocab_json_path = os.path.join(hf_assets_path, "vocab.json")
if os.path.exists(vocab_json_path):
try:
with open(vocab_json_path, encoding="utf-8") as f:
vocab_data = json.load(f)
except Exception:
vocab_data = None
if isinstance(vocab_data, dict):
if len(vocab_data) > 0:
return len(vocab_data)
elif isinstance(vocab_data, list):
if len(vocab_data) > 0:
return len(vocab_data)
# Fallback: try vocab.txt (one token per line, possibly with comments).
vocab_txt_path = os.path.join(hf_assets_path, "vocab.txt")
if os.path.exists(vocab_txt_path):
vocab_size = 0
try:
with open(vocab_txt_path, encoding="utf-8") as f:
for line in f:
stripped = line.strip()
# Skip empty lines and comment lines (commonly start with '#').
if not stripped or stripped.startswith("#"):
continue
vocab_size += 1
except Exception:
vocab_size = 0
if vocab_size > 0:
return vocab_size
# If no known assets are found or none yield a valid size, return None.
return None

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +29
from torchtitan.experiments.ezpz.blendcorpus.blendcorpus_builder import (
BlendCorpusDataLoader,
)

Copilot AI Mar 11, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model-level config registry now imports BlendCorpusDataLoader from torchtitan.experiments.*, creating a dependency from core model code into the experiments tree. Elsewhere under torchtitan/models/*/config_registry.py there are no torchtitan.experiments imports; consider moving this ezpz/BlendCorpus-specific config into an experiments config registry (or keep the model config registry using core dataloaders only) to avoid packaging/layering issues.

Copilot uses AI. Check for mistakes.
cd "${PBS_O_WORKDIR}"
fi

REPO_ROOT="/flare/AuroraGPT/sww/tt_aurora/torchtitan"

Copilot AI Mar 11, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This submission script hard-codes a site/user-specific REPO_ROOT under /flare/..., which will break for other users/environments. Consider deriving the repo root from PBS_O_WORKDIR (if set) or allowing REPO_ROOT to be overridden via qsub -v REPO_ROOT=... with a sensible default.

Suggested change
REPO_ROOT="/flare/AuroraGPT/sww/tt_aurora/torchtitan"
REPO_ROOT="${REPO_ROOT:-${PBS_O_WORKDIR:-/flare/AuroraGPT/sww/tt_aurora/torchtitan}}"

Copilot uses AI. Check for mistakes.
Comment thread configs/ezpz_test.json
"training": {
"local_batch_size": 1,
"global_batch_size": 96,
"seq_len": 8096,

Copilot AI Mar 11, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq_len is set to 8096 here, but 8192 is the common power-of-two sequence length used throughout the repo/configs. If this is meant to be 8192, update the value; if 8096 is intentional, consider adding a short note explaining why this nonstandard length is required.

Suggested change
"seq_len": 8096,
"seq_len": 8192,

Copilot uses AI. Check for mistakes.
# Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT),
# so temporarily disable nounset during environment initialization.
set +u
source <(curl -fsSL https://bit.ly/ezpz-utils)

Copilot AI Mar 11, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The launcher sources and executes a remote script via source <(curl ...), which is a supply-chain risk and makes runs non-reproducible if the remote content changes. Prefer vendoring the needed ezpz helper script into the repo (or pinning to an immutable, checksum-verified artifact) and sourcing that local copy instead.

Suggested change
source <(curl -fsSL https://bit.ly/ezpz-utils)
EZPZ_UTILS_PATH="${SCRIPT_DIR}/../ezpz_utils.sh"
if [[ ! -r "${EZPZ_UTILS_PATH}" ]]; then
echo "Missing ezpz utilities script at ${EZPZ_UTILS_PATH}."
echo "Please vendor the ezpz utils script into the repository and try again."
exit 1
fi
# Source vendored ezpz utilities instead of executing remote code via curl.
# This avoids supply-chain risk and makes runs reproducible.
# shellcheck disable=SC1090
source "${EZPZ_UTILS_PATH}"

Copilot uses AI. Check for mistakes.
@saforem2 saforem2 mentioned this pull request Jun 11, 2026
saforem2 added a commit to nscottnichols/torchtitan that referenced this pull request Jun 11, 2026
`torchtitan/experiments/ezpz/moe/moe.py` was byte-identical to
`torchtitan/models/common/moe.py` at the time the PR was authored.
After the rebase onto current ezpz, the upstream file has moved on
(via the PR pytorch#3447 replay) and the ezpz fork now actively LAGS
upstream — it's missing the CP-friendly 3-D experts output:

    diff torchtitan/experiments/ezpz/moe/moe.py torchtitan/models/common/moe.py
    ...
    < return self.token_dispatcher.combine(routed_output_RD, metadata, x_TD)
    ---
    > out_TD = self.token_dispatcher.combine(routed_output_RD, metadata, x_TD)
    > # Un-flatten back to 3-D (B, *, D) so the local_map output sharding
    > # won't cause _StridedShard in the downstream view (e.g., CP is used).
    > return out_TD.view(B, -1, D)

Delete the local copy entirely and re-import the three names
(MoE, TokenChoiceTopKRouter, GroupedExperts) directly from upstream
in moe/__init__.py and moe/experts.py. This:
- gets the CP fix from upstream PR pytorch#3447 for free
- removes 466 lines of duplicated maintenance burden
- eliminates the silent-skew trap (every upstream MoE/router fix now
  flows in automatically rather than requiring manual replay)
- keeps EzpzGroupedExperts (a true subclass with the for_loop /
  grouped_mm switch) intact

Verified post-deletion that `EzpzGroupedExperts.__bases__` resolves
to the upstream `torchtitan.models.common.moe.GroupedExperts`, and
that `from torchtitan.experiments.ezpz.moe import MoE` resolves to
the upstream class.

Addresses review finding saforem2#8 on saforem2#14 — which became
more pressing post-rebase because the fork started actively lagging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants