Skip to content

[BUG] - Fix three stale-cache bugs after swarm particle addition#216

Open
bknight1 wants to merge 2 commits into
developmentfrom
bugfix/stale-caches-after-swarm-add
Open

[BUG] - Fix three stale-cache bugs after swarm particle addition#216
bknight1 wants to merge 2 commits into
developmentfrom
bugfix/stale-caches-after-swarm-add

Conversation

@bknight1
Copy link
Copy Markdown
Member

Summary

Fixes three stale-cache bugs that occur when a swarm is modified via add_particles_with_coordinates() and then used for interpolation or projection. These affect the common pattern of re-adding particles to empty cells each timestep.

Fixes #215

Bug 1 — Stale kd-tree in add_particles_with_coordinates

add_particles_with_coordinates() calls self.dm.migrate() directly, bypassing Swarm.migrate() which calls _invalidate_canonical_data(). The manual invalidation in add_particles_with_coordinates set _canonical_data = None for coordinates and variables but missed self._kdtree. After particle addition, swarm._get_kdtree() returned a kd-tree built from old coordinates.

Fix: Added self._kdtree = None in the invalidation block in add_particles_with_coordinates.

Bug 2 — Stale cached projector in _project_to_work_variable

_project_to_work_variable() caches Projection solver instances on the mesh object (_eval_projector_scalar and _eval_{shape}_projector). When reused across evaluate() calls after a Stokes solve modifies the DM, the cached projectors PETSc solver state is stale. On PETSc 3.22.2 (Setonix HPC) this causes an MPI deadlock; on PETSc 3.24.2 (macOS) it silently returns stale results.

Fix: Added _force_setup=True to both the scalar (line 640) and tensor (line 613) projector.solve() calls.

Bug 3 — Stale proxy mesh variable after swarm write (documented, no fix)

The lazy proxy update pattern marks the proxy as stale on data write (_proxy_stale = True), but the actual re-interpolation (_rbf_to_meshVar) only fires when material.sym is accessed or _update_proxy_if_stale() is called. Code that reads the proxy MeshVariable DM directly (e.g. a Projection solver evaluating its uw_function at quadrature points) reads stale data.

Workaround: Call material._update_proxy_if_stale() (or access material.sym) after modifying a swarm variable and before using its proxy mesh variable.

Test

Added tests/test_0112_swarm_add_particles.py with test_proxy_updates_after_add_particles that verifies the proxy mesh variable correctly reflects swarm changes after adding particles.

Platform-dependent behavior

See issue #215 comment for details. PETSc 3.22.2 deadlocks on these bugs; 3.24.2 silently returns wrong values.

Underworld development team with AI support from Claude Code

Bug 1 — Stale kd-tree in add_particles_with_coordinates:
  add_particles_with_coordinates() uses self.dm.migrate() directly,
  bypassing Swarm.migrate() which calls _invalidate_canonical_data().
  The manual invalidation at lines 3574-3577 nils _canonical_data but
  missed self._kdtree, causing RBF interpolation to use old particle
  coordinates/neighbors.

Bug 2 — Stale cached projector in _project_to_work_variable:
  Cached Projection solvers on the mesh object were reused across
  evaluate() calls without _force_setup=True. After a Stokes solve
  modifies the DM, the cached projector's PETSc solver state is stale,
  causing deadlock on PETSc 3.22.2 and silent wrong results on 3.24.2.

Bug 3 — Stale proxy mesh variable after swarm write (documented only):
  _update() marks proxy as stale, but the lazy re-interpolation
  (_update_proxy_if_stale()) only fires on material.sym access.
  Code reading the proxy MeshVariable DM directly (e.g. a Projection
  solver evaluating its uw_function) gets stale data.

See GitHub issue #215.

Underworld development team with AI support from Claude Code
Copilot AI review requested due to automatic review settings May 29, 2026 02:21
@bknight1 bknight1 requested a review from lmoresi as a code owner May 29, 2026 02:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two of three stale-cache bugs that occur after Swarm.add_particles_with_coordinates() is called: a stale kd-tree in the swarm, and a stale cached Projection solver used by uw.function.evaluate() for derivative/L2 expressions. A third bug (stale proxy MeshVariable DM after swarm writes) is documented via a TODO comment but not fixed; a workaround is recommended in the PR description. A new regression/stress test file is added.

Changes:

  • Add self._kdtree = None invalidation in Swarm.add_particles_with_coordinates() so subsequent _get_kdtree() rebuilds against the new coordinates.
  • Force _force_setup=True on the scalar and tensor cached projectors in _project_to_work_variable(), rebuilding SNES/DM state every call.
  • Add tests/test_0112_swarm_add_particles.py covering add-after-populate, proxy updates, MPI re-add cycles, kd-tree + save patterns, and a Stokes/projection/evaluate/write loop.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/underworld3/swarm.py Adds kd-tree cache invalidation after the direct dm.migrate() in add_particles_with_coordinates, plus a TODO comment documenting the unfixed stale-proxy case.
src/underworld3/function/_function.pyx Passes _force_setup=True to both cached Projection/MultiComponent_Projection solves to avoid stale solver state.
tests/test_0112_swarm_add_particles.py New regression and stress tests for swarm particle addition, proxy updates, MPI patterns, and a Stokes+projection+evaluate cycle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +613 to +615
# _force_setup=True: rebuild solver state to avoid stale cached
# projector after Stokes/DM modifications (issue #215, Bug 2).
projector.solve(zero_init_guess=False, _force_setup=True)
Comment thread src/underworld3/swarm.py Outdated
Comment on lines 3579 to 3584
# Invalidate cached data — particle count changed after addNPoints + migrate
self._particle_coordinates._canonical_data = None
self._kdtree = None # issue #215, Bug 1: stale kd-tree after add_particles_with_coordinates
for var in self._vars.values():
if hasattr(var, "_canonical_data"):
var._canonical_data = None
Comment thread src/underworld3/swarm.py Outdated
Comment on lines 3577 to 3584
@@ -3572,6 +3578,7 @@ def add_particles_with_coordinates(self, coordinatesArray) -> int:

# Invalidate cached data — particle count changed after addNPoints + migrate
self._particle_coordinates._canonical_data = None
self._kdtree = None # issue #215, Bug 1: stale kd-tree after add_particles_with_coordinates
for var in self._vars.values():
if hasattr(var, "_canonical_data"):
var._canonical_data = None
Comment on lines +242 to +244
with swarm.access(material):
if n_added > 0:
material.data[old_size:old_size + n_added, 0] = material_init[re_add_mask, 0]
Comment on lines +614 to +615
_log = []
dbg = uw.mpi.rank == 0 and (lambda s: (_log.append(s), print(s, flush=True))[-1]) or (lambda s: None)
Replace three manual canonical-data invalidation blocks with calls to
_invalidate_canonical_data(), which now also marks swarm variable
proxies as stale via _update(). This ensures the proxy mesh variable
is re-interpolated after migration, particle addition, and snapshot
restore.

Affected callers:
- Swarm.migrate() — already used _invalidate_canonical_data
- add_particles_with_coordinates — replaced manual block
- apply_snapshot_payload — replaced manual block
- disk_snapshot.py — replaced manual block

Underworld development team with AI support from Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants