Skip to content

Add utility to reshard TFDS dataset for large scale training#3789

Merged
copybara-service[bot] merged 1 commit intomainfrom
eval_data_shard
May 1, 2026
Merged

Add utility to reshard TFDS dataset for large scale training#3789
copybara-service[bot] merged 1 commit intomainfrom
eval_data_shard

Conversation

@RissyRan
Copy link
Copy Markdown
Collaborator

@RissyRan RissyRan commented May 1, 2026

Description

Add a utility to reshard TFDS dataset for large scale training, b/508024540.

  • Eliminates overhead of dynamic re-sharding during training by pre-sharding data to match the host count.
  • Tests show a reduction in training loss (from 4.075 to 3.582). This is because the TFDS fallback logic (ds.shard) performs a "logical shard" that effectively reconstructs the un-shuffled sequential order of the original dataset across the global batch. By physically re-sharding into dedicated files, each host can stream contiguous records into its local shuffle buffer. This ensures that when host data is combined, the global batch contains a truly randomized distribution of data categories, leading to higher-quality gradients and better model performance.

Tests

Local conversation - example log

Before this change:

  • Warning from loading both training and eval dataset, and there is huge performance hit for each training steps
  • end-to-end training logs by loading real dataset, and performance is Tokens/s/device: 80-90 per step link
/deps/src/maxtext/input_pipeline/tfds_data_processing.py:64: UserWarning: WARNING: Inefficient dataloading. Your c4/en:3.0.5 contains 128shards, smaller than dataloading_host_count=2048. This is known to lead to inefficient dataloading.

/deps/src/maxtext/input_pipeline/tfds_data_processing.py:64: UserWarning: WARNING: Inefficient dataloading. Your c4/en:3.0.5 contains 64shards, smaller than dataloading_host_count=2048. This is known to lead to inefficient dataloading

After this change:

  • We reshard the dataset based on host count
  • No warning is found in logs: link
  • end-to-end training logs by loading real dataset, and performance is Tokens/s/device: 600-700 per step link

For training loss:

  • before the change, initial loss: 5.819
  • after the change, initial loss: 5.816
  • observe the training efficiency after re-sharding, the llm_loss at step 77 reduce from 4.075 (before resharding) to 3.582 (after resharding). Both train/validation loss are aligned.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@RissyRan RissyRan changed the title Add a utility to reshard TFDS dataset for large scale training Add utility to reshard TFDS dataset for large scale training May 1, 2026
Comment thread tools/data_generation/reshard_tfds.py Outdated
Comment thread tools/data_generation/reshard_tfds.py Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

🤖 Hi @shuningjin, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request introduces a valuable utility for resharding TFDS datasets, which is critical for maintaining efficient dataloading performance in large-scale training environments (e.g., matching shard count to host count). The implementation is well-structured, leverages multiprocessing effectively, and correctly updates the dataset metadata.

🔍 General Feedback

  • The use of raw_bytes reading and the round-robin distribution strategy into target shards is efficient and helps ensure a balanced distribution of data.
  • The automatic adjustment of num_workers to be a factor of num_shards is a clever way to maintain deterministic shard ownership per worker and avoid write collisions.
  • Overall, the code is high quality and follows good practices for data processing scripts.

Comment thread tools/data_generation/reshard_tfds.py Outdated
Comment thread tools/data_generation/reshard_tfds.py Outdated
Comment thread tools/data_generation/reshard_tfds.py
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary (Addendum)

Providing one additional piece of feedback that was missed in the initial submission regarding dataset naming conventions.

Comment thread tools/data_generation/reshard_tfds.py
@RissyRan RissyRan force-pushed the eval_data_shard branch from 49319fb to 6167d94 Compare May 1, 2026 20:24
@RissyRan RissyRan force-pushed the eval_data_shard branch from 6167d94 to 6459378 Compare May 1, 2026 20:25
Copy link
Copy Markdown
Collaborator

@shuningjin shuningjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve to unblock.

@RissyRan RissyRan force-pushed the eval_data_shard branch from 6459378 to f41923b Compare May 1, 2026 21:39
@RissyRan RissyRan force-pushed the eval_data_shard branch from f41923b to 8caa3fc Compare May 1, 2026 21:58
@copybara-service copybara-service Bot merged commit 0d4af87 into main May 1, 2026
39 of 40 checks passed
@copybara-service copybara-service Bot deleted the eval_data_shard branch May 1, 2026 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants