Summary
Three independent stale-cache issues were found when a swarm is modified via add_particles_with_coordinates() and then used for interpolation or projection. All three cause silent data corruption or MPI deadlock (PETSc < 3.24) and affect the common pattern of re-adding particles to empty cells each timestep.
Bug 1 — Stale kd-tree in add_particles_with_coordinates
File: src/underworld3/swarm.py:3571-3577
add_particles_with_coordinates() calls self.dm.migrate() (the raw PETSc DMSwarm migration) rather than self.migrate() (the UW3 wrapper). The manual cache invalidation at lines 3574-3577 nils _particle_coordinates._canonical_data and each variables _canonical_data, but misses self._kdtree:
# Lines 3574-3577 (BEFORE fix)
self._particle_coordinates._canonical_data = None
for var in self._vars.values():
if hasattr(var, "_canonical_data"):
var._canonical_data = None
# missing: self._kdtree = None
By contrast, Swarm._invalidate_canonical_data() at line 2648 correctly sets self._kdtree = None. It is called by self.migrate() (line 3463), but add_particles_with_coordinates bypasses that path.
Effect: After adding particles, swarm._get_kdtree() returns a kd-tree built from OLD particle coordinates. RBF interpolation (both for proxy mesh variables and for uw.function.evaluate() with rbf=True) looks up particle indices from the stale tree, accessing wrong PETSc memory locations. On PETSc 3.22.2 this produces an MPI deadlock inside the kd-tree query → rbf_evaluate → update_lvec path; on PETSc 3.24.2 it silently returns wrong interpolated values.
Fix applied: Added self._kdtree = None after line 3573.
Bug 2 — Stale cached projector in _project_to_work_variable
File: src/underworld3/function/_function.pyx:529-642
_project_to_work_variable() caches Projection solver instances on the mesh object as _eval_projector_scalar (scalar) or _eval_{shape}_projector (tensor). The solver is created once and reused across all subsequent evaluate() calls on that mesh:
if not hasattr(mesh, "_eval_projector_scalar"):
mesh._eval_projector_scalar = uw.systems.Projection(mesh, ...)
projector = mesh._eval_projector_scalar
projector.uw_function = scalar_expr
projector.solve(zero_init_guess=False) # no _force_setup
When a Stokes solve (or any other solver modifying the DM) runs between two evaluate() calls, the cached projectors PETSc solver state (SNES/KSP/matrix decomposition) is stale. On PETSc 3.22.2 the projector.solve() deadlocks because the cached matrix doesnt match the current DM state. PETSc 3.24.2 tolerates this (silently returns wrong results).
Fix applied: Changed both the scalar projector (line 640) and the tensor projector (line 613) to pass _force_setup=True:
projector.solve(zero_init_guess=False, _force_setup=True)
Note: The same stale-cached-projector pattern exists in user code that reuses Projection solver instances across timesteps or after Stokes solves. Any cached projection solver should either (a) pass _force_setup=True on every solve, or (b) track a DM version counter and auto-rebuild when the DM changes.
Bug 3 — Stale proxy mesh variable data after swarm write
File: src/underworld3/swarm.py:1034-1087 (proxy update pipeline)
When a SwarmVariable has proxy_degree > 0 (the default is proxy_degree=2), a proxy MeshVariable is created that stores RBF-interpolated values from the swarm. The update is lazy:
swarm.access(var) modifies the canonical data array
- On exit,
delay_callbacks_global fires the data callback
- The callback calls
pack_raw_data_to_petsc() (line 478), which writes to PETSc and calls self._update() (line 1291), setting self._proxy_stale = True
- The actual re-interpolation (
_rbf_to_meshVar) happens only when material.sym is accessed or _update_proxy_if_stale() is called
The problem: If code reads the proxys MeshVariable DM directly (e.g., a Projection solver that evaluates its uw_function at quadrature points), it reads STALE data from the proxys PETSc DM — the lazy update hasnt fired yet.
Concrete scenario:
material = swarm.add_variable("material", 1, dtype=int, proxy_degree=2)
meshMat.uw_function = material.sym[0] # triggers proxy update, stores symbol
# ... add particles and set new material values ...
meshMat.solve(_force_setup=True)
# ^ evaluates stored proxy symbol at quadrature points
# ^ proxy DM still contains data from the FIRST sym access — STALE
Why uw.function.evaluate(material.sym[0], ...) works: It re-accesses material.sym, which calls _update_proxy_if_stale() and re-interpolates from the current swarm.
Fix needed: Either:
- (a) Document that
_update_proxy_if_stale() must be called before using the proxy MeshVariable DM after a swarm write
- (b) Make the evaluation pipeline check for stale proxies and auto-update before reading
- (c) Remove the lazy proxy update pattern and update immediately on data write
- (d) Add proxy update hooks in
add_particles_with_coordinates and other swarm-mutating methods
Reproduction
The test file tests/test_0112_swarm_add_particles.py contains test_proxy_updates_after_add_particles which reproduces Bug 1 (kd-tree) and Bug 3 (proxy staleness). Bug 2 was reproduced on Setonix HPC (PETSc 3.22.2) and confirmed locally on macOS (PETSc 3.24.2).
Environment
- PETSc 3.22.2 (Setonix HPC) — deadlocks on Bugs 2
- PETSc 3.24.2 (macOS) — silently returns wrong values on Bugs 2
- Underworld3
development branch (as of 2026-05-29)
Related Files
src/underworld3/swarm.py — lines 3501, 3571-3577 (Bug 1), lines 1034-1087 (Bug 3)
src/underworld3/function/_function.pyx — lines 529-642 (Bug 2)
Summary
Three independent stale-cache issues were found when a swarm is modified via
add_particles_with_coordinates()and then used for interpolation or projection. All three cause silent data corruption or MPI deadlock (PETSc < 3.24) and affect the common pattern of re-adding particles to empty cells each timestep.Bug 1 — Stale kd-tree in
add_particles_with_coordinatesFile:
src/underworld3/swarm.py:3571-3577add_particles_with_coordinates()callsself.dm.migrate()(the raw PETSc DMSwarm migration) rather thanself.migrate()(the UW3 wrapper). The manual cache invalidation at lines 3574-3577 nils_particle_coordinates._canonical_dataand each variables_canonical_data, but missesself._kdtree:By contrast,
Swarm._invalidate_canonical_data()at line 2648 correctly setsself._kdtree = None. It is called byself.migrate()(line 3463), butadd_particles_with_coordinatesbypasses that path.Effect: After adding particles,
swarm._get_kdtree()returns a kd-tree built from OLD particle coordinates. RBF interpolation (both for proxy mesh variables and foruw.function.evaluate()withrbf=True) looks up particle indices from the stale tree, accessing wrong PETSc memory locations. On PETSc 3.22.2 this produces an MPI deadlock inside the kd-tree query →rbf_evaluate→update_lvecpath; on PETSc 3.24.2 it silently returns wrong interpolated values.Fix applied: Added
self._kdtree = Noneafter line 3573.Bug 2 — Stale cached projector in
_project_to_work_variableFile:
src/underworld3/function/_function.pyx:529-642_project_to_work_variable()cachesProjectionsolver instances on the mesh object as_eval_projector_scalar(scalar) or_eval_{shape}_projector(tensor). The solver is created once and reused across all subsequentevaluate()calls on that mesh:When a Stokes solve (or any other solver modifying the DM) runs between two
evaluate()calls, the cached projectors PETSc solver state (SNES/KSP/matrix decomposition) is stale. On PETSc 3.22.2 theprojector.solve()deadlocks because the cached matrix doesnt match the current DM state. PETSc 3.24.2 tolerates this (silently returns wrong results).Fix applied: Changed both the scalar projector (line 640) and the tensor projector (line 613) to pass
_force_setup=True:Note: The same stale-cached-projector pattern exists in user code that reuses
Projectionsolver instances across timesteps or after Stokes solves. Any cached projection solver should either (a) pass_force_setup=Trueon every solve, or (b) track a DM version counter and auto-rebuild when the DM changes.Bug 3 — Stale proxy mesh variable data after swarm write
File:
src/underworld3/swarm.py:1034-1087(proxy update pipeline)When a
SwarmVariablehasproxy_degree > 0(the default isproxy_degree=2), a proxyMeshVariableis created that stores RBF-interpolated values from the swarm. The update is lazy:swarm.access(var)modifies the canonical data arraydelay_callbacks_globalfires the data callbackpack_raw_data_to_petsc()(line 478), which writes to PETSc and callsself._update()(line 1291), settingself._proxy_stale = True_rbf_to_meshVar) happens only whenmaterial.symis accessed or_update_proxy_if_stale()is calledThe problem: If code reads the proxys
MeshVariableDM directly (e.g., aProjectionsolver that evaluates itsuw_functionat quadrature points), it reads STALE data from the proxys PETSc DM — the lazy update hasnt fired yet.Concrete scenario:
Why
uw.function.evaluate(material.sym[0], ...)works: It re-accessesmaterial.sym, which calls_update_proxy_if_stale()and re-interpolates from the current swarm.Fix needed: Either:
_update_proxy_if_stale()must be called before using the proxy MeshVariable DM after a swarm writeadd_particles_with_coordinatesand other swarm-mutating methodsReproduction
The test file
tests/test_0112_swarm_add_particles.pycontainstest_proxy_updates_after_add_particleswhich reproduces Bug 1 (kd-tree) and Bug 3 (proxy staleness). Bug 2 was reproduced on Setonix HPC (PETSc 3.22.2) and confirmed locally on macOS (PETSc 3.24.2).Environment
developmentbranch (as of 2026-05-29)Related Files
src/underworld3/swarm.py— lines 3501, 3571-3577 (Bug 1), lines 1034-1087 (Bug 3)src/underworld3/function/_function.pyx— lines 529-642 (Bug 2)