Add utility to reshard TFDS dataset for large scale training by RissyRan · Pull Request #3789 · AI-Hypercomputer/maxtext

RissyRan · 2026-05-01T04:37:23Z

Description

Add a utility to reshard TFDS dataset for large scale training, b/508024540.

Eliminates overhead of dynamic re-sharding during training by pre-sharding data to match the host count.
Tests show a reduction in training loss (from 4.075 to 3.582). This is because the TFDS fallback logic (ds.shard) performs a "logical shard" that effectively reconstructs the un-shuffled sequential order of the original dataset across the global batch. By physically re-sharding into dedicated files, each host can stream contiguous records into its local shuffle buffer. This ensures that when host data is combined, the global batch contains a truly randomized distribution of data categories, leading to higher-quality gradients and better model performance.

Tests

Local conversation - example log

Before this change:

Warning from loading both training and eval dataset, and there is huge performance hit for each training steps
end-to-end training logs by loading real dataset, and performance is Tokens/s/device: 80-90 per step link

/deps/src/maxtext/input_pipeline/tfds_data_processing.py:64: UserWarning: WARNING: Inefficient dataloading. Your c4/en:3.0.5 contains 128shards, smaller than dataloading_host_count=2048. This is known to lead to inefficient dataloading.

/deps/src/maxtext/input_pipeline/tfds_data_processing.py:64: UserWarning: WARNING: Inefficient dataloading. Your c4/en:3.0.5 contains 64shards, smaller than dataloading_host_count=2048. This is known to lead to inefficient dataloading

After this change:

We reshard the dataset based on host count
No warning is found in logs: link
end-to-end training logs by loading real dataset, and performance is Tokens/s/device: 600-700 per step link

For training loss:

before the change, initial loss: 5.819
after the change, initial loss: 5.816
observe the training efficiency after re-sharding, the llm_loss at step 77 reduce from 4.075 (before resharding) to 3.582 (after resharding). Both train/validation loss are aligned.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

github-actions · 2026-05-01T20:20:31Z

🤖 Hi @shuningjin, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces a valuable utility for resharding TFDS datasets, which is critical for maintaining efficient dataloading performance in large-scale training environments (e.g., matching shard count to host count). The implementation is well-structured, leverages multiprocessing effectively, and correctly updates the dataset metadata.

🔍 General Feedback

The use of raw_bytes reading and the round-robin distribution strategy into target shards is efficient and helps ensure a balanced distribution of data.
The automatic adjustment of num_workers to be a factor of num_shards is a clever way to maintain deterministic shard ownership per worker and avoid write collisions.
Overall, the code is high quality and follows good practices for data processing scripts.

github-actions

## 📋 Review Summary (Addendum)

Providing one additional piece of feedback that was missed in the initial submission regarding dataset naming conventions.

shuningjin

Approve to unblock.

RissyRan requested review from A9isha, NicoGrande, NuojCheng, SurbhiJainUSC, abhinavclemson, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners May 1, 2026 04:37

RissyRan changed the title ~~Add a utility to reshard TFDS dataset for large scale training~~ Add utility to reshard TFDS dataset for large scale training May 1, 2026

RissyRan assigned suexu1025, aireenmei and shuningjin May 1, 2026

aireenmei reviewed May 1, 2026

View reviewed changes

Comment thread tools/data_generation/reshard_tfds.py Outdated

Comment thread tools/data_generation/reshard_tfds.py Outdated

RissyRan force-pushed the eval_data_shard branch from a4cfc35 to 49319fb Compare May 1, 2026 20:16

shuningjin added the gemini-review label May 1, 2026

github-actions Bot reviewed May 1, 2026

View reviewed changes

Comment thread tools/data_generation/reshard_tfds.py Outdated

Comment thread tools/data_generation/reshard_tfds.py Outdated

Comment thread tools/data_generation/reshard_tfds.py

github-actions Bot reviewed May 1, 2026

View reviewed changes

Comment thread tools/data_generation/reshard_tfds.py

RissyRan force-pushed the eval_data_shard branch from 49319fb to 6167d94 Compare May 1, 2026 20:24

RissyRan force-pushed the eval_data_shard branch from 6167d94 to 6459378 Compare May 1, 2026 20:25

shuningjin approved these changes May 1, 2026

View reviewed changes

RissyRan force-pushed the eval_data_shard branch from 6459378 to f41923b Compare May 1, 2026 21:39

aireenmei approved these changes May 1, 2026

View reviewed changes

Add a utility to reshard TFDS dataset for large scale training

8caa3fc

RissyRan force-pushed the eval_data_shard branch from f41923b to 8caa3fc Compare May 1, 2026 21:58

RissyRan unassigned suexu1025, aireenmei and shuningjin May 1, 2026

RissyRan added the pull ready label May 1, 2026

copybara-service Bot merged commit 0d4af87 into main May 1, 2026
39 of 40 checks passed

copybara-service Bot deleted the eval_data_shard branch May 1, 2026 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add utility to reshard TFDS dataset for large scale training#3789

Add utility to reshard TFDS dataset for large scale training#3789
copybara-service[bot] merged 1 commit intomainfrom
eval_data_shard

RissyRan commented May 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

shuningjin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

RissyRan commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shuningjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RissyRan commented May 1, 2026 •

edited

Loading