Add ezpz MoE configurations and JSON override#8
Conversation
Reviewer's GuideAdds ezpz-friendly DeepSeek V3 MoE run configs and launcher scripts, introduces a JSON-based override mechanism for TorchTitan configs, improves DeepSeek V3 model/parallelization behaviors (including selective FFN-only compile and vocab inference), and makes FSDP gradient-division handling more robust across backends and PyTorch versions. Sequence diagram for JSON-based config override and DeepSeek V3 MoE launchsequenceDiagram
actor User
participant Launcher as launch_deepseek_v3_moe_ep12_sh
participant Ezpz as ezpz_utils_env
participant TorchTitan as torchtitan_train_cli
participant ConfigReg as deepseek_v3_config_registry
participant JsonFile as tt_config_json_file
participant Trainer as Trainer
participant Model as DeepSeekV3Model
User->>Launcher: invoke with wandb_name and extra_args
Launcher->>Launcher: set TT_CONFIG_JSON default if unset
Launcher->>Launcher: resolve HF_ASSETS_PATH
Launcher->>Ezpz: ezpz_setup_env
Ezpz-->>Launcher: environment configured
Launcher->>TorchTitan: python -m torchtitan.experiments.ezpz.train activate deepseek_v3_10b_2b_ep12_from_json
TorchTitan->>ConfigReg: deepseek_v3_10b_2b_ep12_from_json()
ConfigReg->>ConfigReg: deepseek_v3_10b_2b_ep12()
ConfigReg-->>TorchTitan: base Trainer_Config
TorchTitan->>ConfigReg: _load_json_overrides()
ConfigReg->>ConfigReg: read TT_CONFIG_JSON_ENV
ConfigReg->>JsonFile: open and parse JSON
JsonFile-->>ConfigReg: overrides dict
ConfigReg-->>TorchTitan: overrides dict
TorchTitan->>ConfigReg: _apply_config_overrides(cfg, overrides)
ConfigReg->>ConfigReg: recursively apply overrides to Trainer_Config
ConfigReg-->>TorchTitan: mutated Trainer_Config
TorchTitan->>Trainer: __init__(config)
Trainer->>Trainer: build tokenizer using HF_ASSETS_PATH
Trainer->>Model: update_from_config(trainer_config)
Model->>Model: _infer_vocab_size_from_tokenizer_assets(hf_assets_path)
Model->>Model: if vocab_size differs, override model.vocab_size
Model-->>Trainer: configured model
Trainer->>Trainer: parallelize_deepseekv3 (activation_checkpoint, compile, fsdp, moe)
Trainer-->>User: training run active with JSON overrides
Updated class diagram for Trainer config, DeepSeek V3 model, and JSON override helpersclassDiagram
class Trainer {
+Config config
+__init__(config)
}
class Trainer_Config {
+TrainingConfig training
+ParallelismConfig parallelism
+ActivationCheckpointConfig activation_checkpoint
+CompileConfig compile
+MetricsProcessor_Config metrics
+OptimizersContainer_Config optimizer
+LRSchedulersContainer_Config lr_scheduler
+CheckpointManager_Config checkpoint
+str hf_assets_path
+Any dataloader
+Any debug
}
class TrainingConfig {
+int local_batch_size
+int global_batch_size
+int seq_len
+int steps
}
class ParallelismConfig {
+int data_parallel_replicate_degree
+int data_parallel_shard_degree
+int tensor_parallel_degree
+int pipeline_parallel_degree
+int context_parallel_degree
+int expert_parallel_degree
+int expert_tensor_parallel_degree
+str pipeline_parallel_schedule
}
class CompileConfig {
+bool enable
+str backend
+list~str~ components
}
class DeepSeekV3Model {
+int vocab_size
+Any layers
+Any rope
+update_from_config(trainer_config)
}
class DeepSeekV3Model_helpers {
+_infer_vocab_size_from_tokenizer_assets(hf_assets_path) int_or_None
+apply_compile_feed_forward_only(model, compile_config) void
}
class DeepSeekV3_ConfigRegistry {
+deepseek_v3_10b_2b_ep12() Trainer_Config
+deepseek_v3_10b_2b_ep12_from_json() Trainer_Config
+_deepseek_v3_10b_2b_model_spec() Any
+_load_json_overrides() dict
+_apply_config_overrides(target, overrides, path) void
}
class Ezpz_AGPT_ConfigRegistry {
+_base_config(flavor) FaultTolerantTrainer_Config
+_config_from_json(flavor) FaultTolerantTrainer_Config
+ezpz_agpt_debugmodel_from_json() FaultTolerantTrainer_Config
+ezpz_agpt_2b_from_json() FaultTolerantTrainer_Config
+ezpz_agpt_7b_from_json() FaultTolerantTrainer_Config
+ezpz_agpt_8b_from_json() FaultTolerantTrainer_Config
+_load_json_overrides() dict
+_apply_config_overrides(target, overrides, path) void
}
class FaultTolerantTrainer_Config {
+Any training
+Any parallelism
+Any dataloader
+Any optimizer
+Any checkpoint
+Any compile
}
Trainer --> Trainer_Config : uses
Trainer_Config --> TrainingConfig : has
Trainer_Config --> ParallelismConfig : has
Trainer_Config --> CompileConfig : has
Trainer --> DeepSeekV3Model : builds
DeepSeekV3Model ..> DeepSeekV3Model_helpers : calls
DeepSeekV3_ConfigRegistry --> Trainer_Config : creates
Ezpz_AGPT_ConfigRegistry --> FaultTolerantTrainer_Config : creates
DeepSeekV3_ConfigRegistry ..> DeepSeekV3Model : configures model_spec
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The JSON override helpers (
_load_json_overrides/_apply_config_overrides) are duplicated betweendeepseek_v3andezpz/agptconfig registries; consider factoring these into a shared utility to avoid divergence and keep the error semantics consistent. - The tokenizer-based vocab size inference in
_infer_vocab_size_from_tokenizer_assetsassumes a specifictokenizer.jsonstructure and successful JSON parse; adding a small try/except and a more defensive fallback path would make this more robust to HF tokenizer format changes or partial assets. - The FSDP gradient-division logic (backend detection and
set_*method checks) is now very similar betweenllama3.parallelizeandezpz/agpt.parallelize; pulling this into a shared helper would reduce duplication and help keep behavior aligned across models as it evolves.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The JSON override helpers (`_load_json_overrides` / `_apply_config_overrides`) are duplicated between `deepseek_v3` and `ezpz/agpt` config registries; consider factoring these into a shared utility to avoid divergence and keep the error semantics consistent.
- The tokenizer-based vocab size inference in `_infer_vocab_size_from_tokenizer_assets` assumes a specific `tokenizer.json` structure and successful JSON parse; adding a small try/except and a more defensive fallback path would make this more robust to HF tokenizer format changes or partial assets.
- The FSDP gradient-division logic (backend detection and `set_*` method checks) is now very similar between `llama3.parallelize` and `ezpz/agpt.parallelize`; pulling this into a shared helper would reduce duplication and help keep behavior aligned across models as it evolves.
## Individual Comments
### Comment 1
<location path="torchtitan/experiments/ezpz/agpt/config_registry.py" line_range="57-66" />
<code_context>
+def _load_json_overrides() -> dict[str, Any]:
</code_context>
<issue_to_address>
**suggestion:** The JSON override helpers are duplicated across modules and could be centralized to avoid divergence.
Since this logic is now duplicated with `torchtitan/models/deepseek_v3/config_registry.py`, please extract these helpers into a shared utility (e.g., a `config_overrides` module) and reuse them from both locations to keep future behavior changes consistent.
Suggested implementation:
```python
from torchtitan.config_overrides import load_json_overrides as _load_json_overrides
```
To fully implement the centralization you requested, you will also need to:
1. Create a shared helper module, e.g. `torchtitan/config_overrides.py`, with something like:
```python
import json
import os
from typing import Any, Dict
TT_CONFIG_JSON_ENV = "TT_CONFIG_JSON"
def load_json_overrides(tt_config_env: str = TT_CONFIG_JSON_ENV) -> Dict[str, Any]:
path = os.environ.get(tt_config_env, "").strip()
if not path:
raise ValueError(
f"{tt_config_env} must point to a JSON file when using *_from_json configs."
)
with open(path, encoding="utf-8") as f:
overrides = json.load(f)
if not isinstance(overrides, dict):
raise TypeError(
f"JSON overrides loaded from {path!r} must be an object at the top level, "
f"got {type(overrides).__name__}"
)
return overrides
```
2. Update `torchtitan/models/deepseek_v3/config_registry.py` to remove its local JSON override helper and reuse the shared one, mirroring how this file now imports it:
```python
from torchtitan.config_overrides import load_json_overrides as _load_json_overrides
```
and delete the duplicated helper implementation there.
3. Ensure that any references in both modules still use `_load_json_overrides()` as before so behavior remains unchanged, while the implementation is now shared and future changes only need to be made in `torchtitan/config_overrides.py`.
</issue_to_address>
### Comment 2
<location path="torchtitan/experiments/ezpz/moe_runs/launch_deepseek_v3_moe_ep12.sh" line_range="20" />
<code_context>
+# Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT),
+# so temporarily disable nounset during environment initialization.
+set +u
+source <(curl -fsSL https://bit.ly/ezpz-utils)
+ezpz_setup_env
+set -u
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Sourcing a remote script on every launch introduces reliability and security risks.
Using `curl -fsSL https://bit.ly/ezpz-utils` at runtime makes each job depend on an unversioned remote script behind a shortlink. Any outage, redirect, or upstream change could break runs or silently alter behavior. Prefer vendoring a copy into the repo or using a stable, versioned URL to improve reproducibility and limit the impact of upstream changes.
Suggested implementation:
```
# Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT),
# so temporarily disable nounset during environment initialization.
set +u
# Use a vendored ezpz-utils script to avoid runtime dependency on a remote shortlink.
EZPZ_UTILS="${REPO_ROOT}/torchtitan/experiments/ezpz/ezpz-utils.sh"
if [[ ! -r "${EZPZ_UTILS}" ]]; then
echo "ERROR: ezpz utils script not found at: ${EZPZ_UTILS}" >&2
echo "Please ensure a stable, versioned copy of ezpz-utils.sh is checked into the repo." >&2
exit 1
fi
# shellcheck source=/dev/null
source "${EZPZ_UTILS}"
ezpz_setup_env
set -u
```
1. Add a vendored copy of the `ezpz-utils` script at:
`torchtitan/experiments/ezpz/ezpz-utils.sh` (this should be a stable, versioned copy from upstream).
2. Keep this file updated via your normal dependency/version management process, rather than sourcing directly from a remote URL at runtime.
</issue_to_address>
### Comment 3
<location path="torchtitan/experiments/ezpz/moe_runs/README.md" line_range="30" />
<code_context>
+- `steps=40`
+- `attn_backend=sdpa`
+- `compile.enable=false`
+- checkpoint save every 5 steps (`enable_first_step_checkpoint=true`)
+
+## Run
</code_context>
<issue_to_address>
**nitpick (typo):** Consider fixing the grammar in this bullet point.
For example: "checkpoint saved every 5 steps" or "checkpoints saved every 5 steps".
```suggestion
- checkpoint saved every 5 steps (`enable_first_step_checkpoint=true`)
```
</issue_to_address>
### Comment 4
<location path="torchtitan/models/deepseek_v3/config_registry.py" line_range="36" />
<code_context>
from . import model_registry
+TT_CONFIG_JSON_ENV = "TT_CONFIG_JSON"
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the JSON config override logic into a shared helper module and having registry files only call those helpers to keep them declarative and DRY.
You can reduce complexity and duplication by extracting the JSON override logic into a shared helper module and keeping this registry file declarative.
### 1. Move override helpers into a shared module
Create e.g. `torchtitan/config/overrides.py`:
```python
# torchtitan/config/overrides.py
import json
import os
from dataclasses import is_dataclass
from typing import Any
def load_overrides_from_env(env_var: str = "TT_CONFIG_JSON") -> dict[str, Any]:
path = os.environ.get(env_var, "").strip()
if not path:
raise ValueError(
f"{env_var} must point to a JSON file when using *_from_json configs."
)
with open(path, encoding="utf-8") as f:
overrides = json.load(f)
if not isinstance(overrides, dict):
raise ValueError(
f"Expected top-level JSON object in {path!r}, "
f"got {type(overrides).__name__}."
)
return overrides
def apply_overrides(target: Any, overrides: dict[str, Any], path: str = "") -> None:
for key, value in overrides.items():
if not hasattr(target, key):
raise KeyError(f"Unknown config field {key!r} at path {path or '<root>'}.")
current_value = getattr(target, key)
field_path = f"{path}.{key}" if path else key
if isinstance(value, dict):
if not is_dataclass(current_value):
raise TypeError(
f"Expected dataclass at {field_path!r} for nested override, "
f"got {type(current_value).__name__}."
)
apply_overrides(current_value, value, field_path)
continue
setattr(target, key, value)
```
This keeps all existing behavior (env var name, validation, recursive dataclass handling), but in one place.
### 2. Simplify this registry file to call shared helpers
In the registry file, remove `_load_json_overrides`, `_apply_config_overrides`, and `TT_CONFIG_JSON_ENV`, and import the shared helpers instead:
```python
# in this registry file
from torchtitan.config.overrides import load_overrides_from_env, apply_overrides
def deepseek_v3_10b_2b_ep12_from_json() -> Trainer.Config:
cfg = deepseek_v3_10b_2b_ep12()
overrides = load_overrides_from_env("TT_CONFIG_JSON")
apply_overrides(cfg, overrides)
return cfg
```
### 3. Update the other registry using the same helpers
In `experiments/ezpz/agpt/config_registry.py` (where the logic is duplicated), replace the local copies with the same import and usage pattern:
```python
from torchtitan.config.overrides import load_overrides_from_env, apply_overrides
def some_config_from_json() -> Trainer.Config:
cfg = some_config()
overrides = load_overrides_from_env("TT_CONFIG_JSON")
apply_overrides(cfg, overrides)
return cfg
```
This removes the recursive override engine from both registry files, avoids copy-paste divergence, and makes future changes to override semantics centralized and easier to maintain, while preserving all current functionality.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def _load_json_overrides() -> dict[str, Any]: | ||
| path = os.environ.get(TT_CONFIG_JSON_ENV, "").strip() | ||
| if not path: | ||
| raise ValueError( | ||
| f"{TT_CONFIG_JSON_ENV} must point to a JSON file when using *_from_json configs." | ||
| ) | ||
|
|
||
| with open(path, encoding="utf-8") as f: | ||
| overrides = json.load(f) | ||
|
|
There was a problem hiding this comment.
suggestion: The JSON override helpers are duplicated across modules and could be centralized to avoid divergence.
Since this logic is now duplicated with torchtitan/models/deepseek_v3/config_registry.py, please extract these helpers into a shared utility (e.g., a config_overrides module) and reuse them from both locations to keep future behavior changes consistent.
Suggested implementation:
from torchtitan.config_overrides import load_json_overrides as _load_json_overridesTo fully implement the centralization you requested, you will also need to:
-
Create a shared helper module, e.g.
torchtitan/config_overrides.py, with something like:import json import os from typing import Any, Dict TT_CONFIG_JSON_ENV = "TT_CONFIG_JSON" def load_json_overrides(tt_config_env: str = TT_CONFIG_JSON_ENV) -> Dict[str, Any]: path = os.environ.get(tt_config_env, "").strip() if not path: raise ValueError( f"{tt_config_env} must point to a JSON file when using *_from_json configs." ) with open(path, encoding="utf-8") as f: overrides = json.load(f) if not isinstance(overrides, dict): raise TypeError( f"JSON overrides loaded from {path!r} must be an object at the top level, " f"got {type(overrides).__name__}" ) return overrides
-
Update
torchtitan/models/deepseek_v3/config_registry.pyto remove its local JSON override helper and reuse the shared one, mirroring how this file now imports it:from torchtitan.config_overrides import load_json_overrides as _load_json_overrides
and delete the duplicated helper implementation there.
-
Ensure that any references in both modules still use
_load_json_overrides()as before so behavior remains unchanged, while the implementation is now shared and future changes only need to be made intorchtitan/config_overrides.py.
| # Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT), | ||
| # so temporarily disable nounset during environment initialization. | ||
| set +u | ||
| source <(curl -fsSL https://bit.ly/ezpz-utils) |
There was a problem hiding this comment.
🚨 suggestion (security): Sourcing a remote script on every launch introduces reliability and security risks.
Using curl -fsSL https://bit.ly/ezpz-utils at runtime makes each job depend on an unversioned remote script behind a shortlink. Any outage, redirect, or upstream change could break runs or silently alter behavior. Prefer vendoring a copy into the repo or using a stable, versioned URL to improve reproducibility and limit the impact of upstream changes.
Suggested implementation:
# Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT),
# so temporarily disable nounset during environment initialization.
set +u
# Use a vendored ezpz-utils script to avoid runtime dependency on a remote shortlink.
EZPZ_UTILS="${REPO_ROOT}/torchtitan/experiments/ezpz/ezpz-utils.sh"
if [[ ! -r "${EZPZ_UTILS}" ]]; then
echo "ERROR: ezpz utils script not found at: ${EZPZ_UTILS}" >&2
echo "Please ensure a stable, versioned copy of ezpz-utils.sh is checked into the repo." >&2
exit 1
fi
# shellcheck source=/dev/null
source "${EZPZ_UTILS}"
ezpz_setup_env
set -u
- Add a vendored copy of the
ezpz-utilsscript at:
torchtitan/experiments/ezpz/ezpz-utils.sh(this should be a stable, versioned copy from upstream). - Keep this file updated via your normal dependency/version management process, rather than sourcing directly from a remote URL at runtime.
| - `steps=40` | ||
| - `attn_backend=sdpa` | ||
| - `compile.enable=false` | ||
| - checkpoint save every 5 steps (`enable_first_step_checkpoint=true`) |
There was a problem hiding this comment.
nitpick (typo): Consider fixing the grammar in this bullet point.
For example: "checkpoint saved every 5 steps" or "checkpoints saved every 5 steps".
| - checkpoint save every 5 steps (`enable_first_step_checkpoint=true`) | |
| - checkpoint saved every 5 steps (`enable_first_step_checkpoint=true`) |
|
|
||
| from . import model_registry | ||
|
|
||
| TT_CONFIG_JSON_ENV = "TT_CONFIG_JSON" |
There was a problem hiding this comment.
issue (complexity): Consider extracting the JSON config override logic into a shared helper module and having registry files only call those helpers to keep them declarative and DRY.
You can reduce complexity and duplication by extracting the JSON override logic into a shared helper module and keeping this registry file declarative.
1. Move override helpers into a shared module
Create e.g. torchtitan/config/overrides.py:
# torchtitan/config/overrides.py
import json
import os
from dataclasses import is_dataclass
from typing import Any
def load_overrides_from_env(env_var: str = "TT_CONFIG_JSON") -> dict[str, Any]:
path = os.environ.get(env_var, "").strip()
if not path:
raise ValueError(
f"{env_var} must point to a JSON file when using *_from_json configs."
)
with open(path, encoding="utf-8") as f:
overrides = json.load(f)
if not isinstance(overrides, dict):
raise ValueError(
f"Expected top-level JSON object in {path!r}, "
f"got {type(overrides).__name__}."
)
return overrides
def apply_overrides(target: Any, overrides: dict[str, Any], path: str = "") -> None:
for key, value in overrides.items():
if not hasattr(target, key):
raise KeyError(f"Unknown config field {key!r} at path {path or '<root>'}.")
current_value = getattr(target, key)
field_path = f"{path}.{key}" if path else key
if isinstance(value, dict):
if not is_dataclass(current_value):
raise TypeError(
f"Expected dataclass at {field_path!r} for nested override, "
f"got {type(current_value).__name__}."
)
apply_overrides(current_value, value, field_path)
continue
setattr(target, key, value)This keeps all existing behavior (env var name, validation, recursive dataclass handling), but in one place.
2. Simplify this registry file to call shared helpers
In the registry file, remove _load_json_overrides, _apply_config_overrides, and TT_CONFIG_JSON_ENV, and import the shared helpers instead:
# in this registry file
from torchtitan.config.overrides import load_overrides_from_env, apply_overrides
def deepseek_v3_10b_2b_ep12_from_json() -> Trainer.Config:
cfg = deepseek_v3_10b_2b_ep12()
overrides = load_overrides_from_env("TT_CONFIG_JSON")
apply_overrides(cfg, overrides)
return cfg3. Update the other registry using the same helpers
In experiments/ezpz/agpt/config_registry.py (where the logic is duplicated), replace the local copies with the same import and usage pattern:
from torchtitan.config.overrides import load_overrides_from_env, apply_overrides
def some_config_from_json() -> Trainer.Config:
cfg = some_config()
overrides = load_overrides_from_env("TT_CONFIG_JSON")
apply_overrides(cfg, overrides)
return cfgThis removes the recursive override engine from both registry files, avoids copy-paste divergence, and makes future changes to override semantics centralized and easier to maintain, while preserving all current functionality.
There was a problem hiding this comment.
Pull request overview
This PR adds an “ezpz” launch/config framework for running DeepSeek V3-style MoE training (including multiple JSON run presets), introduces JSON-based config overlays via TT_CONFIG_JSON, and improves robustness of FSDP gradient-division handling across PyTorch versions.
Changes:
- Add DeepSeek V3 MoE run presets + launcher/submission scripts under
torchtitan/experiments/ezpz/moe_runs/. - Add
*_from_jsonconfig entrypoints that load and apply JSON overrides fromTT_CONFIG_JSON. - Improve PyTorch-version resilience for disabling FSDP gradient division and add a DeepSeekV3 “compile feed-forward only” path.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| torchtitan/trainer.py | Pass additional runtime kwargs (e.g., global batch size / steps / parallel dims) into dataloader construction. |
| torchtitan/models/llama3/parallelize.py | Make FSDP gradient-division disabling resilient to PyTorch internal class location changes. |
| torchtitan/models/deepseek_v3/parallelize.py | Add optional feed_forward-only compilation path for DeepSeek V3. |
| torchtitan/models/deepseek_v3/model.py | Infer vocab size from tokenizer assets and override model vocab_size accordingly. |
| torchtitan/models/deepseek_v3/config_registry.py | Add JSON override overlay utilities + a new DeepSeek V3 MoE config (10b_2b_ep12) and *_from_json entrypoint. |
| torchtitan/experiments/ezpz/moe_runs/submit_deepseek_v3_moe_ep12_128n_6h.pbs | Add a PBS submission template for 128-node runs. |
| torchtitan/experiments/ezpz/moe_runs/launch_deepseek_v3_moe_ep12.sh | Add a launcher that wires W&B + TT_CONFIG_JSON + assets discovery, then launches training. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_smoke.json | Add smoke-test JSON overrides for a small DeepSeek V3 MoE run. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_lb3.json | Add a 2-node 4096-seq config preset (LB=3) with selective AC. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_lb2_compile_ffn.json | Add a 2-node 4096-seq preset that enables feed_forward compile + loss compile. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_layer1_lb3.json | Add a 2-node 4096-seq preset using layer-frequency selective AC. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac_full_lb3.json | Add a 2-node 4096-seq preset using full AC. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim_ac.json | Add a 2-node 4096-seq preset with selective AC. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_prod_sim.json | Add a 2-node 4096-seq preset with AC disabled. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes_4096_perf.json | Add a 2-node 4096-seq throughput/perf-oriented preset. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_2nodes.json | Add a baseline 2-node JSON overrides preset. |
| torchtitan/experiments/ezpz/moe_runs/deepseek_v3_10b2b_ep12_128nodes_4096_prod_sim_ac_lb2_compile_ffn.json | Add a large-scale (128-node) JSON overrides preset. |
| torchtitan/experiments/ezpz/moe_runs/README.md | Document how to use the new MoE run launcher and presets. |
| torchtitan/experiments/ezpz/agpt/parallelize.py | Apply the same FSDP gradient-division robustness pattern in the ezpz AGPT parallelization path. |
| torchtitan/experiments/ezpz/agpt/config_registry.py | Add *_from_json config entrypoints for ezpz AGPT via TT_CONFIG_JSON. |
| configs/ezpz_test.json | Add an example JSON override file for testing the override mechanism. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if [[ ! -f "${HF_ASSETS_PATH}/tokenizer.json" && ! -f "${HF_ASSETS_PATH}/tokenizer.model" ]]; then | ||
| echo "HF_ASSETS_PATH=${HF_ASSETS_PATH} does not contain tokenizer.json or tokenizer.model" | ||
| echo "Please point HF_ASSETS_PATH at a valid tokenizer assets directory." |
There was a problem hiding this comment.
The HF assets validation allows a directory with only tokenizer.model, but the default TorchTitan HuggingFaceTokenizer loader does not support SentencePiece .model files (it expects tokenizer.json or vocab/merges files). This check can therefore pass and still fail later when building the tokenizer; consider requiring tokenizer.json (or vocab/merges) here, or switching the run config to use a tokenizer backend that supports tokenizer.model.
| if [[ ! -f "${HF_ASSETS_PATH}/tokenizer.json" && ! -f "${HF_ASSETS_PATH}/tokenizer.model" ]]; then | |
| echo "HF_ASSETS_PATH=${HF_ASSETS_PATH} does not contain tokenizer.json or tokenizer.model" | |
| echo "Please point HF_ASSETS_PATH at a valid tokenizer assets directory." | |
| if [[ ! -f "${HF_ASSETS_PATH}/tokenizer.json" ]]; then | |
| echo "HF_ASSETS_PATH=${HF_ASSETS_PATH} does not contain the required tokenizer.json file." | |
| echo "Please point HF_ASSETS_PATH at a valid tokenizer assets directory containing tokenizer.json." |
| tokenizer_json_path = os.path.join(hf_assets_path, "tokenizer.json") | ||
| if not os.path.exists(tokenizer_json_path): | ||
| return None | ||
|
|
||
| with open(tokenizer_json_path, encoding="utf-8") as f: | ||
| tokenizer_data = json.load(f) | ||
|
|
||
| vocab = tokenizer_data.get("model", {}).get("vocab", {}) | ||
| added_tokens = tokenizer_data.get("added_tokens", []) | ||
|
|
||
| max_token_id = -1 | ||
| if isinstance(vocab, dict): | ||
| for token_id in vocab.values(): | ||
| if isinstance(token_id, int): | ||
| max_token_id = max(max_token_id, token_id) | ||
|
|
||
| if isinstance(added_tokens, list): | ||
| for token_info in added_tokens: | ||
| if isinstance(token_info, dict): | ||
| token_id = token_info.get("id") | ||
| if isinstance(token_id, int): | ||
| max_token_id = max(max_token_id, token_id) | ||
|
|
||
| if max_token_id < 0: | ||
| return None | ||
| return max_token_id + 1 | ||
|
|
||
|
|
There was a problem hiding this comment.
_infer_vocab_size_from_tokenizer_assets only reads tokenizer.json, but TorchTitan tokenizers can also be loaded from vocab.json/vocab.txt (+ merges.txt). If a valid assets dir lacks tokenizer.json, vocab inference will silently return None and vocab_size will stay at the default, potentially causing a shape mismatch with the tokenizer. Consider reusing the same tokenizer-loading logic as HuggingFaceTokenizer (or otherwise supporting vocab/merges formats) when inferring vocab size.
| tokenizer_json_path = os.path.join(hf_assets_path, "tokenizer.json") | |
| if not os.path.exists(tokenizer_json_path): | |
| return None | |
| with open(tokenizer_json_path, encoding="utf-8") as f: | |
| tokenizer_data = json.load(f) | |
| vocab = tokenizer_data.get("model", {}).get("vocab", {}) | |
| added_tokens = tokenizer_data.get("added_tokens", []) | |
| max_token_id = -1 | |
| if isinstance(vocab, dict): | |
| for token_id in vocab.values(): | |
| if isinstance(token_id, int): | |
| max_token_id = max(max_token_id, token_id) | |
| if isinstance(added_tokens, list): | |
| for token_info in added_tokens: | |
| if isinstance(token_info, dict): | |
| token_id = token_info.get("id") | |
| if isinstance(token_id, int): | |
| max_token_id = max(max_token_id, token_id) | |
| if max_token_id < 0: | |
| return None | |
| return max_token_id + 1 | |
| """ | |
| Try to infer the tokenizer vocabulary size from assets stored in ``hf_assets_path``. | |
| Preference order: | |
| 1. ``tokenizer.json`` (uses max token id + 1, including added tokens) | |
| 2. ``vocab.json`` (length of vocab) | |
| 3. ``vocab.txt`` (number of non-empty, non-comment lines) | |
| """ | |
| # First try tokenizer.json, which may contain explicit token ids. | |
| tokenizer_json_path = os.path.join(hf_assets_path, "tokenizer.json") | |
| if os.path.exists(tokenizer_json_path): | |
| with open(tokenizer_json_path, encoding="utf-8") as f: | |
| tokenizer_data = json.load(f) | |
| vocab = tokenizer_data.get("model", {}).get("vocab", {}) | |
| added_tokens = tokenizer_data.get("added_tokens", []) | |
| max_token_id = -1 | |
| if isinstance(vocab, dict): | |
| for token_id in vocab.values(): | |
| if isinstance(token_id, int): | |
| max_token_id = max(max_token_id, token_id) | |
| if isinstance(added_tokens, list): | |
| for token_info in added_tokens: | |
| if isinstance(token_info, dict): | |
| token_id = token_info.get("id") | |
| if isinstance(token_id, int): | |
| max_token_id = max(max_token_id, token_id) | |
| if max_token_id >= 0: | |
| return max_token_id + 1 | |
| # Fallback: try vocab.json, which typically contains a mapping from token to id | |
| # or a list of tokens. In both cases, the length represents the vocab size. | |
| vocab_json_path = os.path.join(hf_assets_path, "vocab.json") | |
| if os.path.exists(vocab_json_path): | |
| try: | |
| with open(vocab_json_path, encoding="utf-8") as f: | |
| vocab_data = json.load(f) | |
| except Exception: | |
| vocab_data = None | |
| if isinstance(vocab_data, dict): | |
| if len(vocab_data) > 0: | |
| return len(vocab_data) | |
| elif isinstance(vocab_data, list): | |
| if len(vocab_data) > 0: | |
| return len(vocab_data) | |
| # Fallback: try vocab.txt (one token per line, possibly with comments). | |
| vocab_txt_path = os.path.join(hf_assets_path, "vocab.txt") | |
| if os.path.exists(vocab_txt_path): | |
| vocab_size = 0 | |
| try: | |
| with open(vocab_txt_path, encoding="utf-8") as f: | |
| for line in f: | |
| stripped = line.strip() | |
| # Skip empty lines and comment lines (commonly start with '#'). | |
| if not stripped or stripped.startswith("#"): | |
| continue | |
| vocab_size += 1 | |
| except Exception: | |
| vocab_size = 0 | |
| if vocab_size > 0: | |
| return vocab_size | |
| # If no known assets are found or none yield a valid size, return None. | |
| return None |
| from torchtitan.experiments.ezpz.blendcorpus.blendcorpus_builder import ( | ||
| BlendCorpusDataLoader, | ||
| ) |
There was a problem hiding this comment.
This model-level config registry now imports BlendCorpusDataLoader from torchtitan.experiments.*, creating a dependency from core model code into the experiments tree. Elsewhere under torchtitan/models/*/config_registry.py there are no torchtitan.experiments imports; consider moving this ezpz/BlendCorpus-specific config into an experiments config registry (or keep the model config registry using core dataloaders only) to avoid packaging/layering issues.
| cd "${PBS_O_WORKDIR}" | ||
| fi | ||
|
|
||
| REPO_ROOT="/flare/AuroraGPT/sww/tt_aurora/torchtitan" |
There was a problem hiding this comment.
This submission script hard-codes a site/user-specific REPO_ROOT under /flare/..., which will break for other users/environments. Consider deriving the repo root from PBS_O_WORKDIR (if set) or allowing REPO_ROOT to be overridden via qsub -v REPO_ROOT=... with a sensible default.
| REPO_ROOT="/flare/AuroraGPT/sww/tt_aurora/torchtitan" | |
| REPO_ROOT="${REPO_ROOT:-${PBS_O_WORKDIR:-/flare/AuroraGPT/sww/tt_aurora/torchtitan}}" |
| "training": { | ||
| "local_batch_size": 1, | ||
| "global_batch_size": 96, | ||
| "seq_len": 8096, |
There was a problem hiding this comment.
seq_len is set to 8096 here, but 8192 is the common power-of-two sequence length used throughout the repo/configs. If this is meant to be 8192, update the value; if 8096 is intentional, consider adding a short note explaining why this nonstandard length is required.
| "seq_len": 8096, | |
| "seq_len": 8192, |
| # Lmod scripts used by ezpz env setup can reference unset vars (e.g. ZSH_EVAL_CONTEXT), | ||
| # so temporarily disable nounset during environment initialization. | ||
| set +u | ||
| source <(curl -fsSL https://bit.ly/ezpz-utils) |
There was a problem hiding this comment.
The launcher sources and executes a remote script via source <(curl ...), which is a supply-chain risk and makes runs non-reproducible if the remote content changes. Prefer vendoring the needed ezpz helper script into the repo (or pinning to an immutable, checksum-verified artifact) and sourcing that local copy instead.
| source <(curl -fsSL https://bit.ly/ezpz-utils) | |
| EZPZ_UTILS_PATH="${SCRIPT_DIR}/../ezpz_utils.sh" | |
| if [[ ! -r "${EZPZ_UTILS_PATH}" ]]; then | |
| echo "Missing ezpz utilities script at ${EZPZ_UTILS_PATH}." | |
| echo "Please vendor the ezpz utils script into the repository and try again." | |
| exit 1 | |
| fi | |
| # Source vendored ezpz utilities instead of executing remote code via curl. | |
| # This avoids supply-chain risk and makes runs reproducible. | |
| # shellcheck disable=SC1090 | |
| source "${EZPZ_UTILS_PATH}" |
`torchtitan/experiments/ezpz/moe/moe.py` was byte-identical to `torchtitan/models/common/moe.py` at the time the PR was authored. After the rebase onto current ezpz, the upstream file has moved on (via the PR pytorch#3447 replay) and the ezpz fork now actively LAGS upstream — it's missing the CP-friendly 3-D experts output: diff torchtitan/experiments/ezpz/moe/moe.py torchtitan/models/common/moe.py ... < return self.token_dispatcher.combine(routed_output_RD, metadata, x_TD) --- > out_TD = self.token_dispatcher.combine(routed_output_RD, metadata, x_TD) > # Un-flatten back to 3-D (B, *, D) so the local_map output sharding > # won't cause _StridedShard in the downstream view (e.g., CP is used). > return out_TD.view(B, -1, D) Delete the local copy entirely and re-import the three names (MoE, TokenChoiceTopKRouter, GroupedExperts) directly from upstream in moe/__init__.py and moe/experts.py. This: - gets the CP fix from upstream PR pytorch#3447 for free - removes 466 lines of duplicated maintenance burden - eliminates the silent-skew trap (every upstream MoE/router fix now flows in automatically rather than requiring manual replay) - keeps EzpzGroupedExperts (a true subclass with the for_loop / grouped_mm switch) intact Verified post-deletion that `EzpzGroupedExperts.__bases__` resolves to the upstream `torchtitan.models.common.moe.GroupedExperts`, and that `from torchtitan.experiments.ezpz.moe import MoE` resolves to the upstream class. Addresses review finding saforem2#8 on saforem2#14 — which became more pressing post-rebase because the fork started actively lagging.
This pull request introduces a framework for launching DeepSeek V3-style Mixture-of-Experts (MoE) training runs, with a focus on flexible, large-scale configuration via JSON files and environment variables. It adds a new set of configuration files and documentation for various MoE training scenarios, and enhances the configuration registry to support overlaying JSON-based overrides onto base configs. Additionally, it improves robustness in the FSDP gradient division logic to better handle PyTorch version differences.
The most important changes are:
1. MoE Run Configuration and Documentation
moe_runs) containing multiple JSON configuration files for different DeepSeek V3 MoE training scenarios (e.g., 2 nodes, 128 nodes, various batch sizes, activation checkpointing strategies, and compilation options). This enables easy experimentation with different distributed and model configurations. [1] [2] [3] [4] [5] [6] [7] [8] [9]moe_runs/README.md) explaining how to launch these MoE runs, customize configs, and use the new setup.2. JSON-Based Config Overrides
config_registry.pyto support loading configuration overrides from a JSON file specified via theTT_CONFIG_JSONenvironment variable, and overlaying these onto base configs. This allows dynamic, environment-driven configuration without code changes. [1] [2] [3] [4]3. Improved FSDP Gradient Division Handling
parallelize.pyto be more robust against PyTorch version changes by dynamically checking for the presence of relevant methods, rather than relying on specific class imports or locations. This increases compatibility and maintainability. [1] [2]4. Example/Test Configs
ezpz_test.json) to demonstrate or validate the new configuration override mechanism.These changes collectively make it easier to configure, launch, and experiment with large-scale MoE training runs on different hardware setups, while improving the maintainability and flexibility of the codebase.
Summary by Sourcery
Introduce ezpz-based DeepSeek V3 MoE run presets with JSON-driven configuration overrides and more robust runtime behaviors for compilation, vocab sizing, and FSDP gradient handling.
New Features:
Bug Fixes:
Enhancements:
Documentation:
Tests:
Chores: