Skip to content

fix: preserve system IDs during inflight refill#113

Merged
laserkelvin merged 4 commits into
NVIDIA:mainfrom
architdatar:fix/112-refill-system-id
Jun 25, 2026
Merged

fix: preserve system IDs during inflight refill#113
laserkelvin merged 4 commits into
NVIDIA:mainfrom
architdatar:fix/112-refill-system-id

Conversation

@architdatar

@architdatar architdatar commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

Preserve sampler-assigned system_id values for replacement systems during inflight refill. BaseDynamics.refill_check() now seeds rebuilt bookkeeping tensors from values already present on the appended result batch before restoring values for systems that stayed active.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or infrastructure change

Related Issues

Fixes #112

Changes Made

  • Preserve system_id values assigned by SizeAwareSampler during refill_check() for inflight batching.
  • Preserve system_id values for systems already in batch.
  • Add test to verify this workflow.

Testing

  • Unit tests pass locally (uv run pytest test/dynamics/test_inflight.py::TestRefillCheck)
  • Linting passes (make lint)
  • New tests added for new functionality meets coverage expectations?
  • Added regression test for system_id preservation during inflight refill.

Checklist

  • I have read and understand the Contributing Guidelines
  • I have updated the CHANGELOG.md
  • I have performed a self-review of my code
  • I have added docstrings to new functions/classes
  • I have updated the documentation (if applicable)

Additional Notes

This is a Python-only toolkit fix. No nvalchemi-toolkit-ops changes are required.

Signed-off-by: architd <architd@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@architdatar architdatar marked this pull request as ready for review June 15, 2026 22:16
@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug where sampler-assigned system_id values (and any other bookkeeping keys) for replacement systems were silently discarded during BaseDynamics.refill_check(), replaced by default values from default_fn. The fix seeds new_tensor from the already-assembled result batch before overwriting the remaining-systems prefix.

  • nvalchemi/dynamics/base.py: In the bookkeeping loop, result_vals are now copied into new_tensor first (preserving replacement values from the appended batch), then new_tensor[:n_remaining] is overwritten with values from the original remaining systems. A RuntimeError is raised on shape mismatch instead of silently no-oping.
  • test/dynamics/test_inflight.py: Five new regression tests validate system_id and status preservation across all meaningful refill scenarios (full replacement, mixed active/replaced, multi-stage, multiple simultaneous replacements, dataset-provided nonzero status).

Important Files Changed

Filename Overview
nvalchemi/dynamics/base.py Adds a seed-then-overwrite step in the bookkeeping loop: result values (including sampler-assigned system_id for replacements) are copied into new_tensor before the remaining-systems prefix is restored. Shape mismatches now raise RuntimeError instead of silently skipping.
test/dynamics/test_inflight.py Adds five regression tests covering: full replacement, mixed remaining+replacement, multi-stage statuses, multiple simultaneous replacements, and dataset-provided nonzero status preservation — all verifying system_id and status integrity after refill_check.

Reviews (4): Last reviewed commit: "test: cover refill with complex status b..." | Re-trigger Greptile

Comment thread nvalchemi/dynamics/base.py
Comment thread test/dynamics/test_inflight.py
Signed-off-by: architd <architd@nvidia.com>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add another test case where not all samples are graduating?

Essentially, we want to make sure that the refill case where only some slots are being replaced that the system ids are also being preserved then

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I believe that this case is covered in the latest commit (dbcef64) by test_refill_preserves_mixed_system_ids.

That test starts with initial system_ids [0, 1], marks only the first sample as graduated with status = [[1], [0]], then verifies the refill result preserves the remaining system ID and the replacement ID as [1, 2].

Please let me know if you have a different partial-refill scenario in mind.

@laserkelvin laserkelvin Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we only cover the case where the exit and entry status codes are the default cases; you could add one or two unit test cases to make sure that this is working as intended when they're not the default 0 and 1 values

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I added a new commit, 4e7d30a (test: cover refill with complex status bookkeeping), with three additional refill regression tests that cover the non-default / more complex status cases you pointed out.

The new tests cover:

  • test_refill_preserves_system_ids_with_multistage_statuses: a multistage case with statuses [0, 1, 2] and exit_status=2 (as opposed to the standard 1), where only the status-2 system graduates. This verifies the remaining lower-status systems keep their system_ids while the replacement gets the next sampler-assigned ID.

  • test_refill_preserves_system_ids_with_multiple_replacements: a case where two systems graduate in the same refill call. This verifies that the remaining system keeps its ID and both replacement systems keep their newly assigned sampler IDs.

  • test_refill_preserves_replacement_status_from_dataset: a case where replacement samples already carry nonzero graph-level status from the dataset; i.e., non-standard entry-points. This verifies that refill_check() preserves bookkeeping already present on appended replacement samples instead of forcing replacement status back to the default.

Together these should cover partial refill with non-default/multistage statuses, simultaneous replacements, and replacement-carried bookkeeping values.

@laserkelvin

Copy link
Copy Markdown
Collaborator

/ok to test f7b0128

Add refill coverage for a multistage-like status case where only the status-2 system is replaced while lower-status systems keep their IDs.
Add coverage for simultaneous replacement of multiple graduated systems, preserving the remaining system ID and assigning new sampler IDs to both replacements.
Add coverage for replacement-carried bookkeeping by verifying a dataset-provided nonzero status survives refill.

Signed-off-by: architd <architd@nvidia.com>
@laserkelvin

Copy link
Copy Markdown
Collaborator

/ok to test 4e7d30a

@laserkelvin laserkelvin enabled auto-merge June 25, 2026 15:24
@laserkelvin laserkelvin added this pull request to the merge queue Jun 25, 2026
Merged via the queue into NVIDIA:main with commit 35f4530 Jun 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛[BUG]: BaseDynamics.refill_check overwrites replacement system_id during inflight refill

2 participants